Projet P5 - Segmentez des clients d'un site e-commerce¶
OPENCLASSROOMS - Parcours Data Scientist - Adeline Le Ray - 03/2024
Introduction¶
Olist, une entreprise brésilienne qui propose une solution de vente sur les marketplaces en ligne. Olist souhaite obtenir une segmentation de ses clients que les équipes e-commerces pourront utiliser au quotidien pour leurs campagnes de communication.
Démarche
1 - Implémentation de requêtes SQL pour la construction du Dashboard au service des équipes Customer Experience : Ce dashboard expose les KPIs essentiels pour que les équipes puissent avoir de la visibilité sur les états, les villes, ou les vendeurs qui nécessitent un suivi de près de la part de notre service client.
2 - Analyse exploratoire des données : L'analyse exploratoire doit permettre d'approfondir la compréhension du jeu de données et de créer de nouvelles variables par client afin d'obtenir un jeu de données par client pour la segmentation.
3 - Segmentation des clients : L'objectif est de comprendre les différents types d’utilisateurs grâce à leur comportement et à leurs données personnelles. La segmentation proposée doit être exploitable et facile d’utilisation par l'équipe Marketing. Elle doit au minimum pouvoir différencier les bons et moins bons clients en termes de commandes et de satisfaction.
- RFM marketing
- Méthodes non supervisées
4 - Proposition de contrat de maintenance basée sur une analyse de la stabilité des segments au cours du temps : L'objectif de ce notebook est de simuler l'évolution de la stabilité du clustering sur différentes périodes et de recommander la fréquence à laquelle la segmentation doit être mise à jour pour rester pertinente.
Données : Base de données SQLite anonymisée comportant des informations sur l’historique des commandes, les produits achetés, les commentaires de satisfaction, et la localisation des clients depuis janvier 2017.
Sommaire¶
Notebook 1 - Requêtes SQL
Notebook 2 - Analyse exploratoire
Partie 1 - Importation et Inspection des données
Partie 2 - Analyse exploratoire des tables de données
- Table "customers"
- Table "orders"
- Table "order_items"
- Table "order_pymts"
- Table "order_reviews"
- Table "geoloc"
- Tables "products" et "translation"
- Table "sellers"
- Conclusion de l'analyse exploratoire des tables de données
Partie 3 - Données finales pour la segmentation client
- Jonction des tables aggrégées par client
- Analyse descriptive du jeu de données final
- Corrélation entre les variables
- Essais de transformation pour normalisation des distributions
- Sauvegarde du jeu de données
Notebook 3 - Essais clustering
Notebook 4 - Simulation maintenance
Partie 1 - Importation et Inspection des données¶
Importation des librairies¶
import sqlite3
import numpy as np
import pandas as pd
# graphiques
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import plotly.express as px
from skimpy import skim # inspection des dataframes
from IPython.display import Markdown # affichage Markdown des Outputs
from sklearn.preprocessing import OneHotEncoder # Encodeur pour les variables catégorielles
# Version python
!python --version
# Version des librairies utilisées
print('\n'.join(f'{m.__name__} - {m.__version__}'
for m in globals().values()
if getattr(m, '__version__', None)))
Python 3.11.4 numpy - 1.24.4 pandas - 2.2.2 seaborn - 0.13.2 folium - 0.15.1
# Paramètres par défauts des graphiques
sns.set_style('whitegrid') # darkgrid, white grid, dark, white and ticks
plt.rc('axes', titlesize=15) # fontsize of the axes title
plt.rc('axes', labelsize=14) # fontsize of the x and y labels
plt.rc('xtick', labelsize=13) # fontsize of the tick labels
plt.rc('ytick', labelsize=13) # fontsize of the tick labels
plt.rc('legend', fontsize=13) # legend fontsize
plt.rc('font', size=13) # controls default text sizes
width = 7
height = 5
plt.figure(figsize=(width, height))
meanprops = {'marker':'o', 'markeredgecolor':'black','markerfacecolor':'firebrick'}
<Figure size 700x500 with 0 Axes>
# Options d'affichage : toutes les colonnes
pd.set_option('display.max_columns', None)
Définition des fonctions¶
Inspection du dataframe¶
# Fonction pour inspection df
def inspection(df, dfname):
"""!
@brief Effectue une inspection approfondie du dataframe.
Cette fonction affiche les premières lignes, les dimensions, les types de variables, les valeurs manquantes,
identifie les doublons, présente une description statistique du dataframe et un échantillon des modalités des
variables qualitatives.
@param df: Dataframe à inspecter (pandas DataFrame).
@param dfname: Nom du dataframe (str).
"""
# Options d'affichage : toutes les lignes
pd.set_option('display.max_rows', None)
print("*" * 50)
print(f"Inspection du dataframe {dfname}")
print("*" * 50)
print("")
# Affichage des premières lignes du dataframe
print("Dataframe")
print("-" * 50)
display(df.head())
print("")
# Affichage des dimensions, types de variables et valeurs non-null,
# Description statistique du dataframe (moyenne, écart-type, min-max, médiane, IQR)
print("Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null")
print("-" * 50)
print(skim(df))
print("")
# Affichage des valeurs uniques par colonne
print("Valeurs uniques par variable")
print("-" * 50)
print(df.nunique())
print("")
# Identification et affichage des doublons
print("Nombre de doublons")
print("-" * 50)
print(df.duplicated().sum())
print("")
# Affichage d'un échantillon des modalités des variables qualitatives (si applicable)
col = df.select_dtypes(include='object').columns.tolist()
if len(col)>0:
print("")
print("Echantillon des modalités des variables qualitatives (5 modalités max)")
print("-" * 50)
for c in col:
print(f'{c} : {df[c].unique()[:5]}\n')
# Réinitialiser l'option pour revenir aux paramètres par défaut
pd.reset_option('display.max_rows')
Valeurs aberrantes¶
def detectOutlier(data, threshold):
"""!
@brief Détecte les outliers dans une série de données en utilisant la méthode du z-score.
Cette fonction identifie les valeurs aberrantes (outliers) dans une série de données en comparant
les valeurs individuelles à la moyenne et à l'écart-type. Les valeurs au-dessus du seuil spécifié
sont considérées comme des outliers.
@param data: Série de données à analyser pour la détection des outliers (type pandas Series ou liste).
@param threshold: Seuil au-dessus duquel le z-score est considéré comme outlier (type float, valeur recommandée : [2, 3]).
@return outliers: Liste des outliers de la variable (type list).
"""
outliers = [] # Crée une liste vide pour stocker les outliers
mean = np.mean(data) # Calcul de la moyenne des données
std = np.std(data) # Calcul de l'écart-type des données
for i in data:
if std == 0:
print('Écart-type nul')
break
z_score = (i - mean) / std # Calcul du z-score pour la valeur i
if np.abs(z_score) > threshold: # Valeur absolue du z-score
outliers.append(i) # Ajoute i à la liste des outliers
return outliers
def print_outlier(df, columns, res_var):
"""!
@brief: Identifie les outliers pour chaque indicateur et les affiche dans le dataframe de sortie.
@param df : Le dataframe contenant les données à analyser (type pandas.DataFrame).
@param columns : Liste des noms de colonnes à analyser (type list).
@param res_var : Liste des variables à afficher pour le nom de colonne dans le dataframe de sortie (type list).
"""
for col in columns:
# Application de la fonction à la série subset
subset = df[col]
outliers = detectOutlier(subset,2)
# Variables à afficher
if col in res_var:
var = res_var
else:
var = res_var + [col]
# Extract du dataframe
res=df.loc[df[col].isin(outliers),var].sort_values(col, ascending=False)
# Affichage des outliers détectés
print("")
display(Markdown(f"- **{col}** : Nombre d'outliers détectés => " + str(len(outliers))))
# Si le nombre d'outliers est >0, afficher le dataframe res
if len(outliers)>0:
# s'il y a plus de 10 outliers, afficher un extrait de res
if len(outliers)>=10:
print("5 premiers outliers : ")
display(res.head(5))
print("5 derniers outliers : ")
display(res.tail(5))
else:
display(res)
Graphiques¶
# Distribution des variables
def plot_kde(df, nb_cols, features):
"""!
@brief Affiche les distributions de plusieurs variables sous forme d'histogramme
Cette fonction affiche les distributions des variables sélectionnées sous forme d'histogramme.
@param df: Dataframe avec les données (type pd.DataFrame).
@param nb_cols: Nombre de colonnes de l'affichage (type int).
@param features: Liste des variables sélectionnées (type list).
"""
nb_cols = nb_cols
nb_plots = len(features)
nb_rows = nb_plots//nb_cols + nb_plots%nb_cols
fig, axs = plt.subplots(nb_rows, nb_cols, figsize=(width* nb_cols, height* nb_rows), squeeze=False, tight_layout=True)
# Boucle pour créer et afficher les graphiques
for i, ind in enumerate(features):
row = i // nb_cols
col = i % nb_cols
ax = axs[row, col]
sns.kdeplot(df[ind], ax=ax, fill=True)
ax.set_title("Distribution de "+ind, wrap=True)
ax.set_xlabel(ind)
ax.set_ylabel('Nombre')
# Masquer les graphiques vides (s'il y en a)
for i in range(nb_plots, nb_rows * nb_cols):
axs[i // nb_cols, i % nb_cols].axis('off')
plt.show()
Importation et inspection des données¶
# Connexion à la base de données SQLite
conn = sqlite3.connect('olist.db')
Table "customers"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM customers;""", conn), "Table customers")
************************************************** Inspection du dataframe Table customers ************************************************** Dataframe --------------------------------------------------
| index | customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|---|---|
| 0 | 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 861eff4711a542e4b93843c6dd7febb0 | 14409 | franca | SP |
| 1 | 1 | 18955e83d337fd6b2def6b18a428ac77 | 290c77bc529b7ac935b93aa66c333dc3 | 9790 | sao bernardo do campo | SP |
| 2 | 2 | 4e7b3e00288586ebd08712fdd0374a03 | 060e732b5b29e8181a18229c7b0b2b5e | 1151 | sao paulo | SP |
| 3 | 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 259dac757896d24d7702b9acbbff3f3c | 8775 | mogi das cruzes | SP |
| 4 | 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 345ecd01c38d18a9036ed96c73b8d066 | 13056 | campinas | SP |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 99441 │ │ string │ 4 │ │ │ │ Number of columns │ 6 │ │ int32 │ 2 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 50000 │ 29000 │ 0 │ 25000 │ 50000 │ 75000 │ 99000 │ ▇▇▇▇▇▇ │ │ │ │ customer_zip_code_prefi │ 0 │ 0 │ 35000 │ 30000 │ 1000 │ 11000 │ 24000 │ 59000 │ 100000 │ ▇▅▂▁▂▃ │ │ │ │ x │ │ │ │ │ │ │ │ │ │ │ │ │ └─────────────────────────┴─────┴───────┴────────┴────────┴───────┴────────┴───────┴───────┴────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ customer_id │ 0 │ 0 │ 1 │ 99441 │ │ │ │ customer_unique_id │ 0 │ 0 │ 1 │ 99441 │ │ │ │ customer_city │ 0 │ 0 │ 1.8 │ 174313 │ │ │ │ customer_state │ 0 │ 0 │ 1 │ 99441 │ │ │ └────────────────────────────────────┴────────┴───────────┴───────────────────────────┴──────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 99441 customer_id 99441 customer_unique_id 96096 customer_zip_code_prefix 14994 customer_city 4119 customer_state 27 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- customer_id : ['06b8999e2fba1a1fbc88172c00ba8bc7' '18955e83d337fd6b2def6b18a428ac77' '4e7b3e00288586ebd08712fdd0374a03' 'b2b6027bc5c5109e529d4dc6358b12c3' '4f2d8ab171c80ec8364f7c12e35b23ad'] customer_unique_id : ['861eff4711a542e4b93843c6dd7febb0' '290c77bc529b7ac935b93aa66c333dc3' '060e732b5b29e8181a18229c7b0b2b5e' '259dac757896d24d7702b9acbbff3f3c' '345ecd01c38d18a9036ed96c73b8d066'] customer_city : ['franca' 'sao bernardo do campo' 'sao paulo' 'mogi das cruzes' 'campinas'] customer_state : ['SP' 'SC' 'MG' 'PR' 'RJ']
Table "orders"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM orders;""", conn), "Table orders")
************************************************** Inspection du dataframe Table orders ************************************************** Dataframe --------------------------------------------------
| index | order_id | customer_id | order_status | order_purchase_timestamp | order_approved_at | order_delivered_carrier_date | order_delivered_customer_date | order_estimated_delivery_date | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | e481f51cbdc54678b7cc49136f2d6af7 | 9ef432eb6251297304e76186b10a928d | delivered | 2017-10-02 10:56:33 | 2017-10-02 11:07:15 | 2017-10-04 19:55:00 | 2017-10-10 21:25:13 | 2017-10-18 00:00:00 |
| 1 | 1 | 53cdb2fc8bc7dce0b6741e2150273451 | b0830fb4747a6c6d20dea0b8c802d7ef | delivered | 2018-07-24 20:41:37 | 2018-07-26 03:24:27 | 2018-07-26 14:31:00 | 2018-08-07 15:27:45 | 2018-08-13 00:00:00 |
| 2 | 2 | 47770eb9100c2d0c44946d9cf07ec65d | 41ce2a54c0b03bf3443c3d931a367089 | delivered | 2018-08-08 08:38:49 | 2018-08-08 08:55:23 | 2018-08-08 13:50:00 | 2018-08-17 18:06:29 | 2018-09-04 00:00:00 |
| 3 | 3 | 949d5b44dbf5de918fe9c16f97b45f8a | f88197465ea7920adcdbec7375364d82 | delivered | 2017-11-18 19:28:06 | 2017-11-18 19:45:59 | 2017-11-22 13:39:59 | 2017-12-02 00:28:42 | 2017-12-15 00:00:00 |
| 4 | 4 | ad21c59c0840e6cb83a9ceb5573f8159 | 8ab97904e6daea8866dbdbc4fb7aad2c | delivered | 2018-02-13 21:18:39 | 2018-02-13 22:20:29 | 2018-02-14 19:46:34 | 2018-02-16 18:17:02 | 2018-02-26 00:00:00 |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 99441 │ │ string │ 8 │ │ │ │ Number of columns │ 9 │ │ int32 │ 1 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 50000 │ 29000 │ 0 │ 25000 │ 50000 │ 75000 │ 99000 │ ▇▇▇▇▇▇ │ │ │ └─────────────────┴──────┴────────┴─────────┴─────────┴─────┴─────────┴─────────┴─────────┴────────┴─────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ order_id │ 0 │ 0 │ 1 │ 99441 │ │ │ │ customer_id │ 0 │ 0 │ 1 │ 99441 │ │ │ │ order_status │ 0 │ 0 │ 1 │ 99441 │ │ │ │ order_purchase_timestamp │ 0 │ 0 │ 2 │ 198882 │ │ │ │ order_approved_at │ 160 │ 0.16 │ 2 │ 198562 │ │ │ │ order_delivered_carrier_date │ 1783 │ 1.79 │ 2 │ 195316 │ │ │ │ order_delivered_customer_date │ 2965 │ 2.98 │ 1.9 │ 192952 │ │ │ │ order_estimated_delivery_date │ 0 │ 0 │ 2 │ 198882 │ │ │ └─────────────────────────────────────┴───────────┴──────────┴─────────────────────────┴─────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 99441 order_id 99441 customer_id 99441 order_status 8 order_purchase_timestamp 98875 order_approved_at 90733 order_delivered_carrier_date 81018 order_delivered_customer_date 95664 order_estimated_delivery_date 459 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- order_id : ['e481f51cbdc54678b7cc49136f2d6af7' '53cdb2fc8bc7dce0b6741e2150273451' '47770eb9100c2d0c44946d9cf07ec65d' '949d5b44dbf5de918fe9c16f97b45f8a' 'ad21c59c0840e6cb83a9ceb5573f8159'] customer_id : ['9ef432eb6251297304e76186b10a928d' 'b0830fb4747a6c6d20dea0b8c802d7ef' '41ce2a54c0b03bf3443c3d931a367089' 'f88197465ea7920adcdbec7375364d82' '8ab97904e6daea8866dbdbc4fb7aad2c'] order_status : ['delivered' 'invoiced' 'shipped' 'processing' 'unavailable'] order_purchase_timestamp : ['2017-10-02 10:56:33' '2018-07-24 20:41:37' '2018-08-08 08:38:49' '2017-11-18 19:28:06' '2018-02-13 21:18:39'] order_approved_at : ['2017-10-02 11:07:15' '2018-07-26 03:24:27' '2018-08-08 08:55:23' '2017-11-18 19:45:59' '2018-02-13 22:20:29'] order_delivered_carrier_date : ['2017-10-04 19:55:00' '2018-07-26 14:31:00' '2018-08-08 13:50:00' '2017-11-22 13:39:59' '2018-02-14 19:46:34'] order_delivered_customer_date : ['2017-10-10 21:25:13' '2018-08-07 15:27:45' '2018-08-17 18:06:29' '2017-12-02 00:28:42' '2018-02-16 18:17:02'] order_estimated_delivery_date : ['2017-10-18 00:00:00' '2018-08-13 00:00:00' '2018-09-04 00:00:00' '2017-12-15 00:00:00' '2018-02-26 00:00:00']
Observations : il y a autant de customer_id que d'order_id, l'attendu serait plutôt customer_id < order_id
Table "order_items"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM order_items;""", conn), "Table order_items")
************************************************** Inspection du dataframe Table order_items ************************************************** Dataframe --------------------------------------------------
| index | order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 00010242fe8c5a6d1ba2dd792cb16214 | 1 | 4244733e06e7ecb4970a6e2683c13e61 | 48436dade18ac8b2bce089ec2a041202 | 2017-09-19 09:45:35 | 58.90 | 13.29 |
| 1 | 1 | 00018f77f2f0320c557190d7a144bdd3 | 1 | e5f2d52b802189ee658865ca93d83a8f | dd7ddc04e1b6c2c614352b383efe2d36 | 2017-05-03 11:05:13 | 239.90 | 19.93 |
| 2 | 2 | 000229ec398224ef6ca0657da4fc703e | 1 | c777355d18b72b67abbeef9df44fd0fd | 5b51032eddd242adc84c38acab88f23d | 2018-01-18 14:48:30 | 199.00 | 17.87 |
| 3 | 3 | 00024acbcdf0a6daa1e931b038114c75 | 1 | 7634da152a4610f1595efa32f14722fc | 9d7a1d34a5052409006425275ba1c2b4 | 2018-08-15 10:10:18 | 12.99 | 12.79 |
| 4 | 4 | 00042b26cf59d7ce69dfabb4e55b4fd9 | 1 | ac6c3623068f30de03045865e4e10089 | df560393f3a51e74553ab94004ba5c87 | 2017-02-13 13:57:51 | 199.90 | 18.14 |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 112650 │ │ string │ 4 │ │ │ │ Number of columns │ 8 │ │ int32 │ 2 │ │ │ └───────────────────┴────────┘ │ float64 │ 2 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 56000 │ 33000 │ 0 │ 28000 │ 56000 │ 84000 │ 110000 │ ▇▇▇▇▇▇ │ │ │ │ order_item_id │ 0 │ 0 │ 1.2 │ 0.71 │ 1 │ 1 │ 1 │ 1 │ 21 │ ▇ │ │ │ │ price │ 0 │ 0 │ 120 │ 180 │ 0.85 │ 40 │ 75 │ 130 │ 6700 │ ▇ │ │ │ │ freight_value │ 0 │ 0 │ 20 │ 16 │ 0 │ 13 │ 16 │ 21 │ 410 │ ▇ │ │ │ └──────────────────┴─────┴────────┴─────────┴─────────┴───────┴────────┴────────┴────────┴─────────┴─────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ order_id │ 0 │ 0 │ 1 │ 112650 │ │ │ │ product_id │ 0 │ 0 │ 1 │ 112650 │ │ │ │ seller_id │ 0 │ 0 │ 1 │ 112650 │ │ │ │ shipping_limit_date │ 0 │ 0 │ 2 │ 225300 │ │ │ └──────────────────────────────────────┴───────┴───────────┴──────────────────────────┴──────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 112650 order_id 98666 order_item_id 21 product_id 32951 seller_id 3095 shipping_limit_date 93318 price 5968 freight_value 6999 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- order_id : ['00010242fe8c5a6d1ba2dd792cb16214' '00018f77f2f0320c557190d7a144bdd3' '000229ec398224ef6ca0657da4fc703e' '00024acbcdf0a6daa1e931b038114c75' '00042b26cf59d7ce69dfabb4e55b4fd9'] product_id : ['4244733e06e7ecb4970a6e2683c13e61' 'e5f2d52b802189ee658865ca93d83a8f' 'c777355d18b72b67abbeef9df44fd0fd' '7634da152a4610f1595efa32f14722fc' 'ac6c3623068f30de03045865e4e10089'] seller_id : ['48436dade18ac8b2bce089ec2a041202' 'dd7ddc04e1b6c2c614352b383efe2d36' '5b51032eddd242adc84c38acab88f23d' '9d7a1d34a5052409006425275ba1c2b4' 'df560393f3a51e74553ab94004ba5c87'] shipping_limit_date : ['2017-09-19 09:45:35' '2017-05-03 11:05:13' '2018-01-18 14:48:30' '2018-08-15 10:10:18' '2017-02-13 13:57:51']
Table "order_pymts"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM order_pymts;""", conn), "Table order_pymts")
************************************************** Inspection du dataframe Table order_pymts ************************************************** Dataframe --------------------------------------------------
| index | order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|---|
| 0 | 0 | b81ef226f3fe1789b1e8b2acac839d17 | 1 | credit_card | 8 | 99.33 |
| 1 | 1 | a9810da82917af2d9aefd1278f1dcfa0 | 1 | credit_card | 1 | 24.39 |
| 2 | 2 | 25e8ea4e93396b6fa0d3dd708e76c1bd | 1 | credit_card | 1 | 65.71 |
| 3 | 3 | ba78997921bbcdc1373bb41e913ab953 | 1 | credit_card | 8 | 107.78 |
| 4 | 4 | 42fdf880ba16b47b59251dd489d4441a | 1 | credit_card | 2 | 128.45 |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 103886 │ │ int32 │ 3 │ │ │ │ Number of columns │ 6 │ │ string │ 2 │ │ │ └───────────────────┴────────┘ │ float64 │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 52000 │ 30000 │ 0 │ 26000 │ 52000 │ 78000 │ 100000 │ ▇▇▇▇▇▇ │ │ │ │ payment_sequential │ 0 │ 0 │ 1.1 │ 0.71 │ 1 │ 1 │ 1 │ 1 │ 29 │ ▇ │ │ │ │ payment_installments │ 0 │ 0 │ 2.9 │ 2.7 │ 0 │ 1 │ 1 │ 4 │ 24 │ ▇▂▁ │ │ │ │ payment_value │ 0 │ 0 │ 150 │ 220 │ 0 │ 57 │ 100 │ 170 │ 14000 │ ▇ │ │ │ └─────────────────────────┴─────┴───────┴────────┴────────┴─────┴────────┴────────┴────────┴────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ order_id │ 0 │ 0 │ 1 │ 103886 │ │ │ │ payment_type │ 0 │ 0 │ 1 │ 103886 │ │ │ └────────────────────────────┴────────┴────────────┴──────────────────────────────┴──────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 103886 order_id 99440 payment_sequential 29 payment_type 5 payment_installments 24 payment_value 29077 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- order_id : ['b81ef226f3fe1789b1e8b2acac839d17' 'a9810da82917af2d9aefd1278f1dcfa0' '25e8ea4e93396b6fa0d3dd708e76c1bd' 'ba78997921bbcdc1373bb41e913ab953' '42fdf880ba16b47b59251dd489d4441a'] payment_type : ['credit_card' 'boleto' 'voucher' 'debit_card' 'not_defined']
Observations : il y a plus de paiements que d'order_id, plusieurs paiements pour un même order_id?
Table "order_reviews"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM order_reviews;""", conn), "Table order_reviews")
************************************************** Inspection du dataframe Table order_reviews ************************************************** Dataframe --------------------------------------------------
| index | review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 7bc2406110b926393aa56f80a40eba40 | 73fc7af87114b39712e6da79b0a377eb | 4 | None | None | 2018-01-18 00:00:00 | 2018-01-18 21:46:59 |
| 1 | 1 | 80e641a11e56f04c1ad469d5645fdfde | a548910a1c6147796b98fdf73dbeba33 | 5 | None | None | 2018-03-10 00:00:00 | 2018-03-11 03:05:13 |
| 2 | 2 | 228ce5500dc1d8e020d8d1322874b6f0 | f9e4b658b201a9f2ecdecbb34bed034b | 5 | None | None | 2018-02-17 00:00:00 | 2018-02-18 14:36:24 |
| 3 | 3 | e64fb393e7b32834bb789ff8bb30750e | 658677c97b385a9be170737859d3511b | 5 | None | Recebi bem antes do prazo estipulado. | 2017-04-21 00:00:00 | 2017-04-21 22:02:06 |
| 4 | 4 | f7c4243c7fe1938f181bec41a392bdeb | 8e6bfb81e283fa7e4f11123a3fb894f1 | 5 | None | Parabéns lojas lannister adorei comprar pela I... | 2018-03-01 00:00:00 | 2018-03-02 10:26:53 |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 99224 │ │ string │ 6 │ │ │ │ Number of columns │ 8 │ │ int32 │ 2 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 50000 │ 29000 │ 0 │ 25000 │ 50000 │ 74000 │ 99000 │ ▇▇▇▇▇▇ │ │ │ │ review_score │ 0 │ 0 │ 4.1 │ 1.3 │ 1 │ 4 │ 5 │ 5 │ 5 │ ▂ ▁▃▇ │ │ │ └──────────────────┴─────┴────────┴─────────┴─────────┴─────┴─────────┴─────────┴─────────┴────────┴─────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ │ │ │ review_id │ 0 │ 0 │ 1 │ 99224 │ │ │ │ order_id │ 0 │ 0 │ 1 │ 99224 │ │ │ │ review_comment_title │ 87656 │ 88.34 │ 0.24 │ 23660 │ │ │ │ review_comment_message │ 58247 │ 58.7 │ 4.9 │ 484220 │ │ │ │ review_creation_date │ 0 │ 0 │ 2 │ 198448 │ │ │ │ review_answer_timestamp │ 0 │ 0 │ 2 │ 198448 │ │ │ └────────────────────────────────────┴────────────┴────────────┴────────────────────────┴────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 99224 review_id 98410 order_id 98673 review_score 5 review_comment_title 4527 review_comment_message 36159 review_creation_date 636 review_answer_timestamp 98248 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- review_id : ['7bc2406110b926393aa56f80a40eba40' '80e641a11e56f04c1ad469d5645fdfde' '228ce5500dc1d8e020d8d1322874b6f0' 'e64fb393e7b32834bb789ff8bb30750e' 'f7c4243c7fe1938f181bec41a392bdeb'] order_id : ['73fc7af87114b39712e6da79b0a377eb' 'a548910a1c6147796b98fdf73dbeba33' 'f9e4b658b201a9f2ecdecbb34bed034b' '658677c97b385a9be170737859d3511b' '8e6bfb81e283fa7e4f11123a3fb894f1'] review_comment_title : [None 'recomendo' 'Super recomendo' 'Não chegou meu produto ' 'Ótimo'] review_comment_message : [None 'Recebi bem antes do prazo estipulado.' 'Parabéns lojas lannister adorei comprar pela Internet seguro e prático Parabéns a todos feliz Páscoa' 'aparelho eficiente. no site a marca do aparelho esta impresso como 3desinfector e ao chegar esta com outro nome...atualizar com a marca correta uma vez que é o mesmo aparelho' 'Mas um pouco ,travando...pelo valor ta Boa.\r\n'] review_creation_date : ['2018-01-18 00:00:00' '2018-03-10 00:00:00' '2018-02-17 00:00:00' '2017-04-21 00:00:00' '2018-03-01 00:00:00'] review_answer_timestamp : ['2018-01-18 21:46:59' '2018-03-11 03:05:13' '2018-02-18 14:36:24' '2017-04-21 22:02:06' '2018-03-02 10:26:53']
Observations : il y a moins de reviews que d'order_id.
Table "products"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM products;""", conn), "Table products")
************************************************** Inspection du dataframe Table products ************************************************** Dataframe --------------------------------------------------
| index | product_id | product_category_name | product_name_lenght | product_description_lenght | product_photos_qty | product_weight_g | product_length_cm | product_height_cm | product_width_cm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1e9e8ef04dbcff4541ed26657ea517e5 | perfumaria | 40.0 | 287.0 | 1.0 | 225.0 | 16.0 | 10.0 | 14.0 |
| 1 | 1 | 3aa071139cb16b67ca9e5dea641aaa2f | artes | 44.0 | 276.0 | 1.0 | 1000.0 | 30.0 | 18.0 | 20.0 |
| 2 | 2 | 96bd76ec8810374ed1b65e291975717f | esporte_lazer | 46.0 | 250.0 | 1.0 | 154.0 | 18.0 | 9.0 | 15.0 |
| 3 | 3 | cef67bcfe19066a932b7673e239eb23d | bebes | 27.0 | 261.0 | 1.0 | 371.0 | 26.0 | 4.0 | 26.0 |
| 4 | 4 | 9dc1a7de274444849c219cff195d0b71 | utilidades_domesticas | 37.0 | 402.0 | 4.0 | 625.0 | 20.0 | 17.0 | 13.0 |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 32951 │ │ float64 │ 7 │ │ │ │ Number of columns │ 10 │ │ string │ 2 │ │ │ └───────────────────┴────────┘ │ int32 │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 16000 │ 9500 │ 0 │ 8200 │ 16000 │ 25000 │ 33000 │ ▇▇▇▇▇▇ │ │ │ │ product_name_lenght │ 610 │ 1.85 │ 48 │ 10 │ 5 │ 42 │ 51 │ 57 │ 76 │ ▁▃▆▇ │ │ │ │ product_description_leng │ 610 │ 1.85 │ 770 │ 640 │ 4 │ 340 │ 600 │ 970 │ 4000 │ ▇▅▁▁ │ │ │ │ ht │ │ │ │ │ │ │ │ │ │ │ │ │ │ product_photos_qty │ 610 │ 1.85 │ 2.2 │ 1.7 │ 1 │ 1 │ 1 │ 3 │ 20 │ ▇▁ │ │ │ │ product_weight_g │ 2 │ 0.01 │ 2300 │ 4300 │ 0 │ 300 │ 700 │ 1900 │ 40000 │ ▇▁ │ │ │ │ product_length_cm │ 2 │ 0.01 │ 31 │ 17 │ 7 │ 18 │ 25 │ 38 │ 100 │ ▇▅▃▁ │ │ │ │ product_height_cm │ 2 │ 0.01 │ 17 │ 14 │ 2 │ 8 │ 13 │ 21 │ 100 │ ▇▃▁ │ │ │ │ product_width_cm │ 2 │ 0.01 │ 23 │ 12 │ 6 │ 15 │ 20 │ 30 │ 120 │ ▇▃▁ │ │ │ └──────────────────────────┴──────┴───────┴────────┴───────┴─────┴───────┴────────┴────────┴────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ product_id │ 0 │ 0 │ 1 │ 32951 │ │ │ │ product_category_name │ 610 │ 1.85 │ 0.98 │ 32341 │ │ │ └──────────────────────────────────────┴─────────┴───────────┴─────────────────────────┴─────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 32951 product_id 32951 product_category_name 73 product_name_lenght 66 product_description_lenght 2960 product_photos_qty 19 product_weight_g 2204 product_length_cm 99 product_height_cm 102 product_width_cm 95 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- product_id : ['1e9e8ef04dbcff4541ed26657ea517e5' '3aa071139cb16b67ca9e5dea641aaa2f' '96bd76ec8810374ed1b65e291975717f' 'cef67bcfe19066a932b7673e239eb23d' '9dc1a7de274444849c219cff195d0b71'] product_category_name : ['perfumaria' 'artes' 'esporte_lazer' 'bebes' 'utilidades_domesticas']
Table "translation"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM translation;""", conn), "Table translation")
************************************************** Inspection du dataframe Table translation ************************************************** Dataframe --------------------------------------------------
| index | product_category_name | product_category_name_english | |
|---|---|---|---|
| 0 | 0 | beleza_saude | health_beauty |
| 1 | 1 | informatica_acessorios | computers_accessories |
| 2 | 2 | automotivo | auto |
| 3 | 3 | cama_mesa_banho | bed_bath_table |
| 4 | 4 | moveis_decoracao | furniture_decor |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 71 │ │ string │ 2 │ │ │ │ Number of columns │ 3 │ │ int32 │ 1 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 35 │ 21 │ 0 │ 18 │ 35 │ 52 │ 70 │ ▇▇▇▇▇▇ │ │ │ └────────────────────┴──────┴─────────┴─────────┴──────┴──────┴────────┴────────┴───────┴─────────┴──────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ product_category_name │ 0 │ 0 │ 1 │ 71 │ │ │ │ product_category_name_english │ 0 │ 0 │ 1 │ 71 │ │ │ └───────────────────────────────────────┴───────┴───────────┴──────────────────────────┴─────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 71 product_category_name 71 product_category_name_english 71 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- product_category_name : ['beleza_saude' 'informatica_acessorios' 'automotivo' 'cama_mesa_banho' 'moveis_decoracao'] product_category_name_english : ['health_beauty' 'computers_accessories' 'auto' 'bed_bath_table' 'furniture_decor']
Table "sellers"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM sellers;""", conn), "Table sellers")
************************************************** Inspection du dataframe Table sellers ************************************************** Dataframe --------------------------------------------------
| index | seller_id | seller_zip_code_prefix | seller_city | seller_state | |
|---|---|---|---|---|---|
| 0 | 0 | 3442f8959a84dea7ee197c632cb2df15 | 13023 | campinas | SP |
| 1 | 1 | d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | mogi guacu | SP |
| 2 | 2 | ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | rio de janeiro | RJ |
| 3 | 3 | c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | sao paulo | SP |
| 4 | 4 | 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | braganca paulista | SP |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 3095 │ │ string │ 3 │ │ │ │ Number of columns │ 5 │ │ int32 │ 2 │ │ │ └───────────────────┴────────┘ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 1500 │ 890 │ 0 │ 770 │ 1500 │ 2300 │ 3100 │ ▇▇▇▇▇▇ │ │ │ │ seller_zip_code_prefix │ 0 │ 0 │ 32000 │ 33000 │ 1000 │ 7100 │ 15000 │ 65000 │ 100000 │ ▇▂▁ ▁▃ │ │ │ └─────────────────────────┴─────┴───────┴────────┴────────┴───────┴───────┴────────┴───────┴────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ seller_id │ 0 │ 0 │ 1 │ 3095 │ │ │ │ seller_city │ 0 │ 0 │ 1.7 │ 5407 │ │ │ │ seller_state │ 0 │ 0 │ 1 │ 3095 │ │ │ └────────────────────────────┴────────┴────────────┴──────────────────────────────┴──────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 3095 seller_id 3095 seller_zip_code_prefix 2246 seller_city 611 seller_state 23 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- seller_id : ['3442f8959a84dea7ee197c632cb2df15' 'd1b65fc7debc3361ea86b5f14c68d2e2' 'ce3ad9de960102d0677a81f5d0bb7b2d' 'c0f3eea2e14555b6faeea3dd58c1b1c3' '51a04a8a6bdcb23deccc82b0b80742cf'] seller_city : ['campinas' 'mogi guacu' 'rio de janeiro' 'sao paulo' 'braganca paulista'] seller_state : ['SP' 'RJ' 'PE' 'PR' 'GO']
Table "geoloc"¶
# Inspection du data frame
inspection(pd.read_sql_query("""SELECT * FROM geoloc;""", conn), "Table geoloc")
************************************************** Inspection du dataframe Table geoloc ************************************************** Dataframe --------------------------------------------------
| index | geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1037 | -23.545621 | -46.639292 | sao paulo | SP |
| 1 | 1 | 1046 | -23.546081 | -46.644820 | sao paulo | SP |
| 2 | 2 | 1046 | -23.546129 | -46.642951 | sao paulo | SP |
| 3 | 3 | 1041 | -23.544392 | -46.639499 | sao paulo | SP |
| 4 | 4 | 1035 | -23.541578 | -46.641607 | sao paulo | SP |
Dimensions du dataframe, Types de variables, Description statistique du dataframe, Valeurs non-null --------------------------------------------------
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 1000163 │ │ int32 │ 2 │ │ │ │ Number of columns │ 6 │ │ float64 │ 2 │ │ │ └───────────────────┴─────────┘ │ string │ 2 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p50 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━┩ │ │ │ index │ 0 │ 0 │ 500000 │ 290000 │ 0 │ 250000 │ 500000 │ 750000 │ 1000000 │ ▇▇▇▇▇▇ │ │ │ │ geolocation_zip_code_ │ 0 │ 0 │ 37000 │ 31000 │ 1000 │ 11000 │ 27000 │ 64000 │ 100000 │ ▇▅▃▁▂▃ │ │ │ │ prefix │ │ │ │ │ │ │ │ │ │ │ │ │ │ geolocation_lat │ 0 │ 0 │ -21 │ 5.7 │ -37 │ -24 │ -23 │ -20 │ 45 │ ▇▇▁ │ │ │ │ geolocation_lng │ 0 │ 0 │ -46 │ 4.3 │ -100 │ -49 │ -47 │ -44 │ 120 │ ▇ │ │ │ └───────────────────────┴─────┴───────┴────────┴────────┴──────┴────────┴────────┴────────┴─────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ geolocation_city │ 0 │ 0 │ 1.8 │ 1778466 │ │ │ │ geolocation_state │ 0 │ 0 │ 1 │ 1000163 │ │ │ └───────────────────────────────────┴────────┴───────────┴───────────────────────────┴───────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
None Valeurs uniques par variable -------------------------------------------------- index 1000163 geolocation_zip_code_prefix 19015 geolocation_lat 717360 geolocation_lng 717613 geolocation_city 8011 geolocation_state 27 dtype: int64 Nombre de doublons -------------------------------------------------- 0 Echantillon des modalités des variables qualitatives (5 modalités max) -------------------------------------------------- geolocation_city : ['sao paulo' 'são paulo' 'sao bernardo do campo' 'jundiaí' 'taboão da serra'] geolocation_state : ['SP' 'RN' 'AC' 'RJ' 'ES']
Observations : il y a des valeurs de latitude/longitude qui semblent aberrantes et il y a beaucoup de valeurs en doublons pour les codes postaux.
Partie 2 - Analyse exploratoire des jeux données¶
Analyse exploratoire avec requêtes SQL : ne pas charger l'ensemble des bases de données. A la fin aggrégation par clients des différentes variables
Table "customers"¶
- Clé primaire et doublons
# Doublons customer_id
nb_doublons = pd.read_sql_query("""
SELECT COUNT(*) AS nombre_doublons
FROM (
SELECT customer_id, COUNT(*) AS occurrence_count
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1
) AS doublons;
""", conn)
display(Markdown(f"Il y a {nb_doublons.iloc[0,0]} doublon(s) parmi les`customer_id`."))
Il y a 0 doublon(s) parmi lescustomer_id.
# Nombre de clients `customer_id`
df_temp = pd.read_sql_query("""
SELECT COUNT(*) as nb_customers
FROM customers;
""", conn)
nb_customers_id = df_temp.nb_customers[0]
display(Markdown(f"Olist a {nb_customers_id} `customer_id` référencés dans sa base de données"))
Olist a 99441 customer_id référencés dans sa base de données
# Doublons customer_unique_id
df_temp = pd.read_sql_query("""
SELECT customer_unique_id, COUNT(*) AS occurrence_count
FROM customers
GROUP BY customer_unique_id
HAVING COUNT(*) > 1;
""", conn)
display(Markdown(f"Il y a {df_temp.shape[0]} doublon(s) parmi les`customer_unique_id`, "\
+f"soit {round(df_temp.shape[0]/nb_customers_id*100,2)}% du jeu de données."))
# Exemple de doublons
display(Markdown("Un même client `customer_unique_id` peut avoir plusieurs `customer_id` : "\
+f"ici entre {min(df_temp['occurrence_count'])} et {max(df_temp['occurrence_count'])}."))
sample = df_temp['customer_unique_id'][0]
display(pd.read_sql_query("""
SELECT * FROM customers
WHERE customer_unique_id = '{}';
""".format(sample), conn))
# Boxplot du nombre de doublons de `customer_unique_id`
plt.figure(figsize=(7,3))
plt.title('Distribution du nombre de doublons de `customer_unique_id`')
sns.boxplot(data=df_temp, x='occurrence_count', showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
Il y a 2997 doublon(s) parmi lescustomer_unique_id, soit 3.01% du jeu de données.
Un même client customer_unique_id peut avoir plusieurs customer_id : ici entre 2 et 17.
| index | customer_id | customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|---|---|
| 0 | 19299 | 1afe8a9c67eec3516c09a8bdcc539090 | 00172711b30d52eea8b313a7f2cced02 | 45200 | jequie | BA |
| 1 | 35608 | 24b0e2bd287e47d54d193e7bbb51103f | 00172711b30d52eea8b313a7f2cced02 | 45200 | jequie | BA |
# Nombre de clients `customer_id`
df_temp = pd.read_sql_query("""
SELECT customer_unique_id, COUNT(customer_unique_id) AS occurrence_count
FROM customers
GROUP BY customer_unique_id;
""", conn)
display(Markdown(f"Olits a {df_temp.shape[0]} `customer_unique_id` référencés dans sa base de données"))
Olits a 96096 customer_unique_id référencés dans sa base de données
- Nombre de clients par état
# Nombre de clients par état
df_temp = pd.read_sql_query("""
SELECT customer_state, COUNT(customer_unique_id) AS nb_customers
FROM customers
GROUP BY customer_state
ORDER BY nb_customers DESC;
""", conn)
# Représentation sous forme de barplot
plt.figure(figsize=(15,7))
sns.barplot(data = df_temp, x='customer_state', y='nb_customers', palette='viridis', hue='nb_customers')
plt.title('Nombre de clients par état')
plt.show()
# Pourcentage de clients dans chaque état
df_temp['%_customer_per_state'] = round(df_temp['nb_customers']/df_temp['nb_customers'].sum()*100,1)
display(df_temp.sort_values('%_customer_per_state', ascending=False))
| customer_state | nb_customers | %_customer_per_state | |
|---|---|---|---|
| 0 | SP | 41746 | 42.0 |
| 1 | RJ | 12852 | 12.9 |
| 2 | MG | 11635 | 11.7 |
| 3 | RS | 5466 | 5.5 |
| 4 | PR | 5045 | 5.1 |
| 5 | SC | 3637 | 3.7 |
| 6 | BA | 3380 | 3.4 |
| 7 | DF | 2140 | 2.2 |
| 8 | ES | 2033 | 2.0 |
| 9 | GO | 2020 | 2.0 |
| 10 | PE | 1652 | 1.7 |
| 11 | CE | 1336 | 1.3 |
| 12 | PA | 975 | 1.0 |
| 13 | MT | 907 | 0.9 |
| 14 | MA | 747 | 0.8 |
| 15 | MS | 715 | 0.7 |
| 16 | PB | 536 | 0.5 |
| 17 | PI | 495 | 0.5 |
| 18 | RN | 485 | 0.5 |
| 19 | AL | 413 | 0.4 |
| 20 | SE | 350 | 0.4 |
| 21 | TO | 280 | 0.3 |
| 22 | RO | 253 | 0.3 |
| 23 | AM | 148 | 0.1 |
| 24 | AC | 81 | 0.1 |
| 25 | AP | 68 | 0.1 |
| 26 | RR | 46 | 0.0 |
# Nombre de city par état
df_temp = pd.read_sql_query("""
SELECT customer_state, COUNT(DISTINCT customer_city) AS nb_cities
FROM customers
GROUP BY customer_state
ORDER BY nb_cities DESC;
""", conn)
# Représentation sous forme de barplot
plt.figure(figsize=(15,7))
sns.barplot(data = df_temp, x='customer_state', y='nb_cities', palette='viridis', hue='nb_cities')
plt.title('Nombre de villes par état où il y a des clients')
plt.show()
# Clients avec plusieurs adresses
df_temp = pd.read_sql_query("""
SELECT
customer_unique_id,
customer_zip_code_prefix,
customer_city,
customer_state
FROM customers
GROUP BY
customer_unique_id,
customer_zip_code_prefix,
customer_city,
customer_state
;
""", conn)
df_temp[df_temp['customer_unique_id'].duplicated(keep=False)]
| customer_unique_id | customer_zip_code_prefix | customer_city | customer_state | |
|---|---|---|---|---|
| 124 | 004b45ec5c64187465168251cd1c9c2f | 57035 | maceio | AL |
| 125 | 004b45ec5c64187465168251cd1c9c2f | 57055 | maceio | AL |
| 145 | 0058f300f57d7b93c477a131a59b36c3 | 40731 | salvador | BA |
| 146 | 0058f300f57d7b93c477a131a59b36c3 | 41370 | salvador | BA |
| 440 | 012452d40dafae4df401bced74cdb490 | 3220 | sao paulo | SP |
| ... | ... | ... | ... | ... |
| 95215 | fd09c64a101e3eff4adbca1b28552514 | 8542 | ferraz de vasconcelos | SP |
| 95676 | fe3e52de024b82706717c38c8e183084 | 36420 | ouro branco | MG |
| 95677 | fe3e52de024b82706717c38c8e183084 | 72306 | brasilia | DF |
| 95718 | fe59d5878cd80080edbd29b5a0a4e1cf | 71065 | brasilia | DF |
| 95719 | fe59d5878cd80080edbd29b5a0a4e1cf | 71065 | guara | DF |
508 rows × 4 columns
Pour un même client, il peut y avoir des zip_code, vity ou state différent. La cause de ces différences peut être :
- la localisation correspond au lieu de livraison
- la localisation correspond à la localisation d'où a été passée la commande
- le client a déménagé
Pour l'adresse client sera conservée la localisation de la dernière commande.
Table "orders"¶
- Doublons
order_id
# Doublons order_id
df_temp = pd.read_sql_query("""
SELECT order_id, COUNT(*) AS occurrence_count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
""", conn)
display(Markdown(f"Il y a {df_temp.shape[0]} doublon(s) parmi les`order_id`."))
Il y a 0 doublon(s) parmi lesorder_id.
- Période des ventes
# Période des ventes
display(pd.read_sql_query("""
SELECT DATE(MIN(order_purchase_timestamp)) AS first_order_date, DATE(MAX(order_purchase_timestamp)) AS last_order_date
FROM orders;
""", conn))
| first_order_date | last_order_date | |
|---|---|---|
| 0 | 2016-09-04 | 2018-10-17 |
- Nombre de commandes par
order_status
# Nombre de commandes par statut
df_temp = pd.read_sql_query("""
SELECT order_status,
COUNT(order_purchase_timestamp) AS nb_orders,
ROUND(CAST(COUNT(order_purchase_timestamp) AS FLOAT) / (SELECT COUNT(*) FROM orders)*100,1) AS percent_of_orders
FROM orders
GROUP BY order_status
ORDER BY nb_orders DESC;
""", conn)
display(Markdown("Pour la suite de l'étude, nous filtrerons les commandes sur le status **'delivered'**"))
display(df_temp)
Pour la suite de l'étude, nous filtrerons les commandes sur le status 'delivered'
| order_status | nb_orders | percent_of_orders | |
|---|---|---|---|
| 0 | delivered | 96478 | 97.0 |
| 1 | shipped | 1107 | 1.1 |
| 2 | canceled | 625 | 0.6 |
| 3 | unavailable | 609 | 0.6 |
| 4 | invoiced | 314 | 0.3 |
| 5 | processing | 301 | 0.3 |
| 6 | created | 5 | 0.0 |
| 7 | approved | 2 | 0.0 |
- Nombre de ventes par jour
# Nombre de ventes par jour
nb_orders_df = pd.read_sql_query("""
SELECT DATE(order_purchase_timestamp) AS order_date, COUNT(order_id) AS nb_orders
FROM orders
GROUP BY order_date
ORDER BY order_date ASC;
""", conn)
# Convertir 'order_date' en type datetime
nb_orders_df['order_date'] = pd.to_datetime(nb_orders_df['order_date'])
# Time series : Nombre de ventes par jour
plt.figure(figsize=(20, 7))
plt.title('Nombre de ventes par jour', fontsize=18)
sns.lineplot(data=nb_orders_df, x='order_date', y='nb_orders')
plt.xticks(rotation=45)
plt.show()
# Distribution du nombre de ventes par jour
plt.figure(figsize=(20, 7))
plt.title('Distribution du nombre de ventes par jour', fontsize=18)
sns.boxplot(data=nb_orders_df, x='nb_orders', showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
# Valeurs atypiques
print('')
display(Markdown('**Ventes très basses avant janvier 2017**'))
# Time series : Nombre de ventes par jour
plt.figure(figsize=(20, 7))
plt.title('Nombre de ventes par jour avant janvier 2017', fontsize=18)
sns.lineplot(data=nb_orders_df[nb_orders_df['order_date']<'2017-01-15'], x='order_date', y='nb_orders')
plt.xticks(rotation=45)
plt.show()
display(Markdown('**Pic en novembre 2017** : les dates correspondent au Black Friday'))
display(nb_orders_df[nb_orders_df['nb_orders']>400])
print('')
display(Markdown('**Baisse des ventes après le 15/08/2018**'))
# Time series : Nombre de ventes par jour
plt.figure(figsize=(20, 7))
plt.title('Nombre de ventes par jour après la mi-août 2018', fontsize=18)
sns.lineplot(data=nb_orders_df[nb_orders_df['order_date']>'2018-08-15'], x='order_date', y='nb_orders')
plt.xticks(rotation=45)
plt.show()
Ventes très basses avant janvier 2017
Pic en novembre 2017 : les dates correspondent au Black Friday
| order_date | nb_orders | |
|---|---|---|
| 338 | 2017-11-24 | 1176 |
| 339 | 2017-11-25 | 499 |
| 341 | 2017-11-27 | 403 |
Baisse des ventes après le 15/08/2018
- Pic de ventes de novembre 2017 - Black Friday
# Analyse des catégories produits les plus vendues lors du Black Friday
df_temp = pd.read_sql_query("""
SELECT t.product_category_name_english, COUNT(oi.product_id) AS nb_product
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
LEFT JOIN products p ON oi.product_id = p.product_id
LEFT JOIN translation t ON p.product_category_name = t.product_category_name
WHERE
o.order_purchase_timestamp >= DATE('2017-11-24')
AND o.order_purchase_timestamp <= DATE('2017-11-27')
GROUP BY t.product_category_name_english
ORDER BY nb_product DESC
LIMIT 15;
""", conn)
plt.figure(figsize=(5,5))
sns.barplot(data=df_temp, x='nb_product', y='product_category_name_english', orient='h')
plt.show()
- Nombre de ventes par mois
# Nombre de ventes par mois
nb_orders_df['order_year_month'] = nb_orders_df['order_date'].dt.strftime('%Y-%m')
nb_orders_month_df = nb_orders_df.groupby('order_year_month', as_index=False).sum('nb_orders')
# Time series : Nombre de ventes par mois
plt.figure(figsize=(20, 7))
plt.title('Nombre de ventes par mois', fontsize=18)
sns.barplot(data=nb_orders_month_df, x='order_year_month', y='nb_orders')
plt.xticks(rotation=45)
plt.show()
# Distribution du nombre de ventes par mois
plt.figure(figsize=(20, 7))
plt.title('Distribution du nombre de ventes par mois', fontsize=18)
sns.boxplot(data=nb_orders_month_df, x='nb_orders', showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
- Nombre de ventes par an
# Nombre de ventes par an
nb_orders_df['order_year'] = nb_orders_df['order_date'].dt.strftime('%Y')
nb_orders_year_df = nb_orders_df.groupby('order_year').sum('nb_orders').reset_index()
# Time series : Nombre de ventes par an
plt.title('Nombre de ventes par an', fontsize=12)
sns.barplot(data=nb_orders_year_df, x='order_year', y='nb_orders')
plt.show()
- Répartition des ventes par heure
# Répartition des commandes par heure
hour_orders_df = pd.read_sql_query("""
SELECT strftime('%H', o.order_purchase_timestamp) AS order_hour, COUNT(o.order_id) AS nb_orders, c.customer_state
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
GROUP BY order_hour, c.customer_state
ORDER BY order_hour ASC;
""", conn)
# Représentation graphique des résultats
display(Markdown("Les commandes sont réalisées à toute heure de la journée et plus fréquemment entre 8h et 23h. "))
fig, ax = plt.subplots(1,1,figsize=(20,7))
hour_orders_df.pivot_table(index=['order_hour'], columns='customer_state', values = 'nb_orders', aggfunc='sum')\
.plot(ax=ax, kind='bar',
title="Répartition des commandes par heure et par customer_state",
stacked=True)
ax.legend(bbox_to_anchor=(1, 1, 0, 0))
plt.show()
Les commandes sont réalisées à toute heure de la journée et plus fréquemment entre 8h et 23h.
Il y a 4 fuseaux horaires au Brésil : cela explique les horaires d'achats.
- Délai de livraison
df_temp = pd.read_sql_query("""
SELECT JULIANDAY(o.order_delivered_customer_date) - JULIANDAY(o.order_purchase_timestamp) AS delivery_delay
FROM orders o
""", conn)
# Délai de livraison
plt.title('Distribution du délai de livraison (jours)', fontsize=18)
sns.boxplot(data=df_temp, x='delivery_delay', showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
display(df_temp.describe())
| delivery_delay | |
|---|---|
| count | 96476.000000 |
| mean | 12.558702 |
| std | 9.546530 |
| min | 0.533414 |
| 25% | 6.766403 |
| 50% | 10.217755 |
| 75% | 15.720327 |
| max | 209.628611 |
- Nombre de ventes (frequency) et date de la dernière vente (recency) par client
# Nombre de ventes et date de la dernière vente par client
df_temp = pd.read_sql_query("""
WITH LastOrderDate AS (
SELECT MAX(order_purchase_timestamp) AS last_order_date
FROM orders
)
SELECT c.customer_unique_id,
ROUND(JULIANDAY((SELECT last_order_date FROM LastOrderDate)) - JULIANDAY(MAX(o.order_purchase_timestamp))) AS recency,
COUNT(DISTINCT o.order_id) AS frequency
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.customer_unique_id
;
""", conn)
# Distribution de la fréquance d'achat et de la récence
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(20,5), tight_layout=True)
ax1.set_title("Distribution de la fréquence d'achats", fontsize=18)
sns.boxplot(data=df_temp, x='frequency', showfliers=True, showmeans=True, meanprops=meanprops, ax=ax1)
ax2.set_title("Distribution de la récence d'achats", fontsize=18)
sns.boxplot(data=df_temp, x='recency', showfliers=True, showmeans=True, meanprops=meanprops, ax=ax2)
plt.show()
# Nombre de clients ayant effectué plusieurs achats
nb_customers = df_temp[df_temp['frequency']>1].shape[0]
nb_tot_customers = df_temp .shape[0]
percent = round(nb_customers / nb_tot_customers*100,1)
display(Markdown(f"Seuls {percent}% des clients ont effectué plusieurs achats."))
Seuls 3.1% des clients ont effectué plusieurs achats.
# Courbe de Lorenz
customer = df_temp['frequency'].values
n = len(customer)
lorenz = np.cumsum(np.sort(customer)) / customer.sum()
lorenz = np.append([0],lorenz) # La courbe de Lorenz commence à 0
xaxis = np.linspace(0-1/n,1+1/n,n+1) #Il y a un segment de taille n pour chaque individu, plus 1 segment supplémentaire d'ordonnée 0. Le premier segment commence à 0-1/n, et le dernier termine à 1+1/n.
plt.plot(xaxis,lorenz,drawstyle='steps-post')
plt.plot([0,1], [0,1]) #tracer la bisséctrice
plt.axvline(x=0.97, color='r', linestyle='--')
plt.text(0.90, -0.06, '0.97', color='r', fontsize=12)
plt.axhline(y=0.93, color='r', linestyle='--')
plt.text(-0.08, 0.92, '0.93', color='r', fontsize=12)
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel("% des clients")
plt.ylabel("% du nombre de commandes")
plt.grid(color='gray', linestyle='-', linewidth=0.5)
plt.title("Nombres de commandes par client (frequency)",fontsize=11)
plt.show()
- Evolution du nombre de clients
# Nombre de clients ayant effectué des commandes par mois
df_temp = pd.read_sql_query("""
SELECT
strftime('%Y-%m', o.order_purchase_timestamp) AS month,
COUNT(DISTINCT c.customer_unique_id) AS nb_customers
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
GROUP BY strftime('%Y-%m', o.order_purchase_timestamp)
""", conn)
# Time series :Nombre de clients ayant effectué des commandes par mois
plt.figure(figsize=(20, 7))
plt.title('Nombre de clients ayant effectué des commandes par mois', fontsize=12)
sns.barplot(data=df_temp, x='month', y='nb_customers')
plt.xticks(rotation=45)
plt.show()
- Dataframe avec frequency, recency, mean_delivery_delay, zip_code, city, state
# Nombre de ventes et date de la dernière vente par client
customer_df = pd.read_sql_query("""
WITH LastOrderDate AS (
SELECT MAX(order_purchase_timestamp) AS last_order_date
FROM orders
),
LastLocation AS(
SELECT
c.customer_unique_id,
c.customer_zip_code_prefix,
c.customer_city,
c.customer_state,
MAX(o.order_purchase_timestamp)
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY c.customer_unique_id
)
SELECT c.customer_unique_id,
ROUND(JULIANDAY((SELECT last_order_date FROM LastOrderDate)) - JULIANDAY(MAX(o.order_purchase_timestamp))) AS recency,
COUNT(DISTINCT o.order_id) AS frequency,
AVG(JULIANDAY(o.order_delivered_customer_date) - JULIANDAY(o.order_estimated_delivery_date)) AS mean_delivery_delay,
l.customer_zip_code_prefix,
l.customer_city,
l.customer_state
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
LEFT JOIN LastLocation l ON c.customer_unique_id = l.customer_unique_id
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY c.customer_unique_id
HAVING mean_delivery_delay IS NOT NULL
;
""", conn)
display(Markdown(f"Dataframe 'customer_df' de dimensions {customer_df.shape}, features : {list(customer_df.columns)}"))
Dataframe 'customer_df' de dimensions (93096, 7), features : ['customer_unique_id', 'recency', 'frequency', 'mean_delivery_delay', 'customer_zip_code_prefix', 'customer_city', 'customer_state']
Table "order_items"¶
- Doublons
order_id
# Doublons order_id
item_per_order_df = pd.read_sql_query("""
SELECT order_id, COUNT(*) AS occurrence_count
FROM order_items
GROUP BY order_id
ORDER BY occurrence_count DESC;
""", conn)
nb_doublons = item_per_order_df[item_per_order_df['occurrence_count']>1].shape[0]
display(Markdown(f"Il y a {nb_doublons} doublon(s) parmi les`order_id`,"\
+f" soit {nb_doublons} commandes avec plus de 1 article."))
# Affichage d'un exemple de doublon
sample = item_per_order_df['order_id'][10]
display(pd.read_sql_query("""
SELECT * FROM order_items
WHERE order_id = '{}';
""".format(sample), conn))
# Distribution du nombre d'items par commande
print('')
min_items = min(item_per_order_df['occurrence_count'])
max_items = max(item_per_order_df['occurrence_count'])
display(Markdown(f"Les commandes passées ont entre {min_items} et {max_items} items."))
plt.figure(figsize=(15, 5))
plt.title("Distribution du nombre d'items par commande", fontsize=18)
sns.boxplot(data=item_per_order_df, x='occurrence_count', showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
# Commandes avec plus de 15 articles
item_per_order_df[item_per_order_df['occurrence_count']>=15]
Il y a 9803 doublon(s) parmi lesorder_id, soit 9803 commandes avec plus de 1 article.
| index | order_id | order_item_id | product_id | seller_id | shipping_limit_date | price | freight_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | 43755 | 637617b3ffe9e2f7a2411243829226d0 | 1 | e11092e9722d1b6f8c18cd8947a1daff | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 1 | 43756 | 637617b3ffe9e2f7a2411243829226d0 | 2 | b6ada73818651fe7e64be10653062ea2 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 2 | 43757 | 637617b3ffe9e2f7a2411243829226d0 | 3 | b6ada73818651fe7e64be10653062ea2 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 3 | 43758 | 637617b3ffe9e2f7a2411243829226d0 | 4 | b6ada73818651fe7e64be10653062ea2 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 4 | 43759 | 637617b3ffe9e2f7a2411243829226d0 | 5 | e11092e9722d1b6f8c18cd8947a1daff | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 5 | 43760 | 637617b3ffe9e2f7a2411243829226d0 | 6 | e11092e9722d1b6f8c18cd8947a1daff | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 6 | 43761 | 637617b3ffe9e2f7a2411243829226d0 | 7 | 1ba4e3fe92f16fd5a8942f7b7d804b52 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 7 | 43762 | 637617b3ffe9e2f7a2411243829226d0 | 8 | 1ba4e3fe92f16fd5a8942f7b7d804b52 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 8 | 43763 | 637617b3ffe9e2f7a2411243829226d0 | 9 | 1ba4e3fe92f16fd5a8942f7b7d804b52 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.01 |
| 9 | 43764 | 637617b3ffe9e2f7a2411243829226d0 | 10 | bf44071ef18f5c9ded039681c68b1996 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.02 |
| 10 | 43765 | 637617b3ffe9e2f7a2411243829226d0 | 11 | bf44071ef18f5c9ded039681c68b1996 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.02 |
| 11 | 43766 | 637617b3ffe9e2f7a2411243829226d0 | 12 | bf44071ef18f5c9ded039681c68b1996 | 1f50f920176fa81dab994f9023523100 | 2018-04-25 02:51:20 | 79.9 | 24.02 |
Les commandes passées ont entre 1 et 21 items.
| order_id | occurrence_count | |
|---|---|---|
| 0 | 8272b63d03f5f79c56e9e4120aec44ef | 21 |
| 1 | ab14fdcfbe524636d65ee38360e22ce8 | 20 |
| 2 | 1b15974a0141d54e36626dca3fdc731a | 20 |
| 3 | 9ef13efd6949e4573a18964dd1bbe7f5 | 15 |
| 4 | 428a2f660dc84138d969ccd69a0ab6d5 | 15 |
- Clé primaire :
order_id+order_item_id
# Doublons order_id + order_item_id
df_temp = pd.read_sql_query("""
SELECT order_id, product_id, COUNT(*) AS occurrence_count
FROM order_items
GROUP BY order_id, order_item_id
HAVING COUNT(*) > 1
ORDER BY order_id, order_item_id;
""", conn)
display(Markdown(f"Il y a {df_temp.shape[0]} doublon(s) parmi les`order_id`+`order_item_id`."))
Il y a 0 doublon(s) parmi lesorder_id+order_item_id.
- Taille du panier moyen
# Taille du panier moyen
basket_df = pd.read_sql_query("""
WITH nb_items AS (
SELECT
oi.order_id,
COUNT(oi.order_item_id) AS nb_items
FROM order_items oi
GROUP BY oi.order_id
)
SELECT
c.customer_unique_id,
AVG(i.nb_items) AS average_basket
FROM orders o
LEFT JOIN nb_items i ON o.order_id = i.order_id
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY c.customer_unique_id;
""", conn)
display(Markdown(f"Dataframe 'basket_df' de dimensions {basket_df.shape}"))
# Distribution du panier moyen
plt.figure(figsize=(15, 5))
sns.boxplot(data = basket_df, x='average_basket', showfliers=True, showmeans=True, meanprops=meanprops)
plt.title('Distribution du panier moyen')
plt.show()
Dataframe 'basket_df' de dimensions (93104, 2)
- Catégories de produits achetés, quantité et nombre d'achats
# Analyse des types d'achats
purchase_df = pd.read_sql_query("""
WITH nb_items AS (
SELECT
oi.order_id,
COUNT(oi.order_item_id) AS nb_items
FROM order_items oi
GROUP BY oi.order_id
),
product_translations AS (
SELECT
p.product_id,
t.product_category_name_english
FROM products p
LEFT JOIN translation t ON p.product_category_name = t.product_category_name
)
SELECT
COUNT(DISTINCT o.order_id) AS nb_orders,
COALESCE(SUM(i.nb_items), 0) AS total_items,
AVG(i.nb_items) AS average_basket,
COUNT(DISTINCT oi.product_id) AS nb_products,
c.customer_unique_id,
pt.product_category_name_english
FROM orders o
LEFT JOIN nb_items i ON o.order_id = i.order_id
LEFT JOIN customers c ON o.customer_id = c.customer_id
LEFT JOIN order_items oi ON o.order_id = oi.order_id
LEFT JOIN product_translations pt ON oi.product_id = pt.product_id
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY c.customer_unique_id, pt.product_category_name_english
ORDER BY total_items DESC;
""", conn)
# Afficher les 20 premiers
display(purchase_df.head(20))
| nb_orders | total_items | average_basket | nb_products | customer_unique_id | product_category_name_english | |
|---|---|---|---|---|---|---|
| 0 | 1 | 441 | 21.000000 | 3 | 4546caea018ad8c692964e3382debd19 | health_beauty |
| 1 | 1 | 400 | 20.000000 | 1 | 698e1cf81d01a3d389d96145f7fa6df8 | auto |
| 2 | 1 | 400 | 20.000000 | 1 | c402f431464c72e27330a67f7b94d4fb | computers_accessories |
| 3 | 2 | 234 | 13.000000 | 2 | 0f5ac8d5c31de21d2f25e24be15bbffb | furniture_decor |
| 4 | 1 | 225 | 15.000000 | 1 | 11f97da02237a49c8e783dfda6f50e8e | garden_tools |
| 5 | 1 | 196 | 14.000000 | 1 | 31e412b9fb766b6794724ed17a41dfa6 | garden_tools |
| 6 | 1 | 196 | 14.000000 | 1 | f7ea4eef770a388bd5b225acfc546604 | telephony |
| 7 | 1 | 169 | 13.000000 | 1 | 7582a5a77fc2976628f46a13ec91b375 | telephony |
| 8 | 1 | 144 | 12.000000 | 4 | 37bc3d463e2a0024012a7fa587597a3c | baby |
| 9 | 1 | 144 | 12.000000 | 2 | 4d99682572b7b5932340a0bce676c18c | housewares |
| 10 | 1 | 144 | 12.000000 | 1 | 9eca552063680b7cd2ed0e8aea002b21 | watches_gifts |
| 11 | 4 | 144 | 6.000000 | 1 | c8460e4251689ba205045f3ea17884a1 | telephony |
| 12 | 1 | 144 | 12.000000 | 1 | ce9f8b9c31d83341764708396ac7e38b | housewares |
| 13 | 1 | 144 | 12.000000 | 2 | d3383e8df3cd44cd351aecff92e34627 | housewares |
| 14 | 2 | 137 | 9.133333 | 4 | eae0a83d752b1dd32697e0e7b4221656 | office_furniture |
| 15 | 2 | 125 | 9.615385 | 7 | d97b3cfb22b0d6b25ac9ed4e9c2d481b | bed_bath_table |
| 16 | 1 | 121 | 11.000000 | 1 | a5c6335399140e986db84120c425adf0 | housewares |
| 17 | 1 | 100 | 10.000000 | 1 | 09e9991c50513a303a2da26619870d7f | computers_accessories |
| 18 | 1 | 100 | 10.000000 | 1 | 20a5257c01689ac69410a14cb51bb447 | computers_accessories |
| 19 | 1 | 100 | 10.000000 | 2 | 38a4f1deb45ca914dd13c73b41775d71 | furniture_living_room |
Les clients qui ont commandés plus de 100 exemplaires d'un même produit dans les catégories computers_accessories, auto, watches_gifts, telephony sont des revendeurs, B2B (business to business).
# Valeurs NaN pour les catégories
purchase_df.isna().sum()
nb_orders 0 total_items 0 average_basket 0 nb_products 0 customer_unique_id 0 product_category_name_english 1404 dtype: int64
# Remplacer les 'product_category_name_english' NaN par 'category_not_defined'
purchase_df['product_category_name_english'] = purchase_df['product_category_name_english'].fillna('category_not_defined')
# Liste des catégories
product_category = list(purchase_df['product_category_name_english'].unique())
# Catégories B2B
B2B_categories = ['computers_accessories', 'auto', 'watches_gifts', 'telephony']
# Identifier les B2B
B2B_customer = purchase_df.loc[(purchase_df['total_items']>100) &
(purchase_df['product_category_name_english'].isin(B2B_categories)),
'customer_unique_id']
# Ecarter les clients qui ont commandés plus de 100 exemplaires d'un même produit dans les catégories B2B
purchase_df = purchase_df[~purchase_df['customer_unique_id'].isin(B2B_customer)]
# Pivoter le df pour avoir les catégories en colonne
purchase_category_df = purchase_df.pivot_table(index='customer_unique_id',
columns='product_category_name_english',
values='total_items',
aggfunc='sum').reset_index().fillna(0)
# Ajouter une variable total_items
purchase_category_df['total_items'] = purchase_category_df.iloc[:,1:].sum(axis=1)
display(Markdown(f"Dataframe 'purchase_category_df' de dimensions {purchase_category_df.shape}"))
Dataframe 'purchase_category_df' de dimensions (93098, 74)
# Jonction de basket_df et de purchase_category_df
purchases_df = pd.merge(basket_df, purchase_category_df, on='customer_unique_id', how='inner' )
display(purchases_df.head())
| customer_unique_id | average_basket | agro_industry_and_commerce | air_conditioning | art | arts_and_craftmanship | audio | auto | baby | bed_bath_table | books_general_interest | books_imported | books_technical | category_not_defined | cds_dvds_musicals | christmas_supplies | cine_photo | computers | computers_accessories | consoles_games | construction_tools_construction | construction_tools_lights | construction_tools_safety | cool_stuff | costruction_tools_garden | costruction_tools_tools | diapers_and_hygiene | drinks | dvds_blu_ray | electronics | fashio_female_clothing | fashion_bags_accessories | fashion_childrens_clothes | fashion_male_clothing | fashion_shoes | fashion_sport | fashion_underwear_beach | fixed_telephony | flowers | food | food_drink | furniture_bedroom | furniture_decor | furniture_living_room | furniture_mattress_and_upholstery | garden_tools | health_beauty | home_appliances | home_appliances_2 | home_comfort_2 | home_confort | home_construction | housewares | industry_commerce_and_business | kitchen_dining_laundry_garden_furniture | la_cuisine | luggage_accessories | market_place | music | musical_instruments | office_furniture | party_supplies | perfumery | pet_shop | security_and_services | signaling_and_security | small_appliances | small_appliances_home_oven_and_coffee | sports_leisure | stationery | tablets_printing_image | telephony | toys | watches_gifts | total_items | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0000366f3b9a7992bf8c76cfdf3221e2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 0000b849f77a49e4a4ce2b2a4ca5be3f | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0000f46a3911fa3c0805444483337064 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 0000f6ccb0745a6a4b88665a16c9f078 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0004aac84e0df4da2b147fca70cf8255 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
# Courbe de Lorenz
customer = purchases_df['total_items'].values
n = len(customer)
lorenz = np.cumsum(np.sort(customer)) / customer.sum()
lorenz = np.append([0],lorenz) # La courbe de Lorenz commence à 0
xaxis = np.linspace(0-1/n,1+1/n,n+1) #Il y a un segment de taille n pour chaque individu, plus 1 segment supplémentaire d'ordonnée 0. Le premier segment commence à 0-1/n, et le dernier termine à 1+1/n.
plt.plot(xaxis,lorenz,drawstyle='steps-post')
plt.plot([0,1], [0,1]) #tracer la bisséctrice
plt.axvline(x=0.88, color='r', linestyle='--')
plt.text(0.86, -0.06, '0.88', color='r', fontsize=12)
plt.axhline(y=0.54, color='r', linestyle='--')
plt.text(-0.08, 0.54, '0.54', color='r', fontsize=12)
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel("% des clients")
plt.ylabel("% du nombre d'articles vendus")
plt.grid(color='gray', linestyle='-', linewidth=0.5)
plt.title("Nombres d'articles achetés par client (total_items)",fontsize=11)
plt.show()
Table "order_pymts"¶
- Doublons
order_id
# Doublons order_id
pymts_per_order_df = pd.read_sql_query("""
SELECT order_id, COUNT(*) AS occurrence_count
FROM order_pymts
GROUP BY order_id
HAVING COUNT(*)>1;
""", conn)
nb_doublons = pymts_per_order_df.shape[0]
display(Markdown(f"Il y a {nb_doublons} doublon(s) parmi les`order_id`,"\
+f" soit {nb_doublons} commandes avec des paiements en plusieurs fois."))
# Affichage d'un exemple de doublon
sample = pymts_per_order_df['order_id'].head(1).values[0]
display(pd.read_sql_query("""
SELECT * FROM order_pymts
WHERE order_id = '{}';
""".format(sample), conn))
Il y a 2961 doublon(s) parmi lesorder_id, soit 2961 commandes avec des paiements en plusieurs fois.
| index | order_id | payment_sequential | payment_type | payment_installments | payment_value | |
|---|---|---|---|---|---|---|
| 0 | 80856 | 0016dfedd97fc2950e388d2971d718c7 | 2 | voucher | 1 | 17.92 |
| 1 | 89575 | 0016dfedd97fc2950e388d2971d718c7 | 1 | credit_card | 5 | 52.63 |
Note : payment_installments : échelonnage des paiments
- Nombre de paiements par commandes
# Paiements en x fois
df_temp = pd.read_sql_query("""
SELECT payment_sequential
FROM order_pymts;
""", conn)
# Distribution du nombre de paiement pour une même commande
print('')
min_payment_sequential = min(df_temp['payment_sequential'])
max_payment_sequential = max(df_temp['payment_sequential'])
display(Markdown(f"Les paiements des commandes ont été effectués en {min_payment_sequential} à {max_payment_sequential} fois."))
plt.figure(figsize=(15, 5))
plt.title("Distribution du nombre de paiement pour une même commande", fontsize=18)
sns.boxplot(data=df_temp, x='payment_sequential', showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
Les paiements des commandes ont été effectués en 1 à 29 fois.
- Type de paiements utilisés
# Type de paiements
df_temp = pd.read_sql_query("""
SELECT
payment_type,
COUNT(order_id) AS occurrence_count,
MIN(payment_value) AS min_payment_value,
MAX(payment_value) AS max_payment_value
FROM order_pymts
GROUP BY payment_type
ORDER BY occurrence_count DESC;
""", conn)
display(df_temp)
| payment_type | occurrence_count | min_payment_value | max_payment_value | |
|---|---|---|---|---|
| 0 | credit_card | 76795 | 0.01 | 13664.08 |
| 1 | boleto | 19784 | 11.62 | 7274.88 |
| 2 | voucher | 5775 | 0.00 | 3184.34 |
| 3 | debit_card | 1529 | 13.38 | 4445.50 |
| 4 | not_defined | 3 | 0.00 | 0.00 |
df_temp = pd.read_sql_query("""
SELECT payment_type, COUNT(order_id) as occurrence_count
FROM order_pymts
GROUP BY payment_type
ORDER BY occurrence_count DESC;
""", conn)
colors = ['#482475', '#355f8d', '#21918c', '#44bf70', '#bddf26']
plt.pie(x=df_temp['occurrence_count'],
colors = colors,
autopct='%.1f%%',
textprops={'fontsize': 12},
pctdistance=1.15)
plt.legend(df_temp['payment_type'], bbox_to_anchor=(1,1))
plt.show()
- Analyse des payment_type = 'not_defined'
# Analyse des payment_type = 'not_defined'
df_temp = pd.read_sql_query("""
SELECT p.order_id, p.payment_type, p.payment_value, o.order_status
FROM order_pymts p
INNER JOIN orders o ON p.order_id = o.order_id
WHERE payment_type = 'not_defined';
""", conn)
display(Markdown("`payment_type` = 'not_defined' => commande 'canceled'"))
display(df_temp)
payment_type = 'not_defined' => commande 'canceled'
| order_id | payment_type | payment_value | order_status | |
|---|---|---|---|---|
| 0 | 4637ca194b6387e2d538dc89b124b0ee | not_defined | 0.0 | canceled |
| 1 | 00b1cb0320190ca0daa2c88b35206009 | not_defined | 0.0 | canceled |
| 2 | c8c528189310eaa44a745b8d9d26908b | not_defined | 0.0 | canceled |
- Recettes par jour, mois et année
# Recettes par jour
income_df = pd.read_sql_query("""
WITH total_payment AS (
SELECT p.order_id, SUM(p.payment_value) AS total_payment
FROM order_pymts p
GROUP BY p.order_id
)
SELECT DATE(o.order_purchase_timestamp) AS order_date, SUM(t.total_payment) AS income
FROM total_payment t
INNER JOIN orders o ON t.order_id = o.order_id
GROUP BY order_date
ORDER BY order_date ASC;
""", conn)
# Convertir 'order_date' en type datetime
income_df['order_date'] = pd.to_datetime(income_df['order_date'])
# Time series : Recettes par jour
plt.figure(figsize=(20, 7))
plt.title('Recette par jour', fontsize=18)
sns.lineplot(data=income_df, x='order_date', y='income')
plt.xticks(rotation=45)
plt.show()
# Recette par mois
income_df['order_year_month'] = income_df['order_date'].dt.strftime('%Y-%m')
income_month_df = income_df.groupby('order_year_month', as_index=False).sum('income')
# Time series : Recettes par mois
plt.figure(figsize=(20, 7))
plt.title('Recette par mois', fontsize=18)
sns.barplot(data=income_month_df, x='order_year_month', y='income')
plt.xticks(rotation=45)
plt.show()
# Recettes par an
income_df['order_year'] = income_df['order_date'].dt.strftime('%Y')
income_year_df = income_df.groupby('order_year').sum('income').reset_index()
# Time series : Recettes par an
plt.title('Recettes par an', fontsize=12)
sns.barplot(data=income_year_df, x='order_year', y='income')
plt.show()
- Montant total des achats par client (monetary)
monetary_df = pd.read_sql_query("""
WITH total_payment AS (
SELECT p.order_id, SUM(p.payment_value) AS total_payment
FROM order_pymts p
GROUP BY p.order_id
)
SELECT c.customer_unique_id,
SUM(t.total_payment) AS monetary,
AVG(t.total_payment) AS average_basket_amount
FROM total_payment t
LEFT JOIN orders o ON t.order_id = o.order_id
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY c.customer_unique_id
ORDER BY monetary DESC
;
""", conn)
display(Markdown(f"Dataframe 'monetary_df' de dimensions {monetary_df.shape}, features : {list(monetary_df.columns)}"))
Dataframe 'monetary_df' de dimensions (93104, 3), features : ['customer_unique_id', 'monetary', 'average_basket_amount']
# Courbe de Lorenz
customer = monetary_df['monetary'].values
n = len(customer)
lorenz = np.cumsum(np.sort(customer)) / customer.sum()
lorenz = np.append([0],lorenz) # La courbe de Lorenz commence à 0
xaxis = np.linspace(0-1/n,1+1/n,n+1) #Il y a un segment de taille n pour chaque individu, plus 1 segment supplémentaire d'ordonnée 0. Le premier segment commence à 0-1/n, et le dernier termine à 1+1/n.
plt.plot(xaxis,lorenz,drawstyle='steps-post')
plt.plot([0,1], [0,1]) #tracer la bisséctrice
plt.axvline(x=0.90, color='r', linestyle='--')
plt.text(0.85, -0.06, '0.90', color='r', fontsize=12)
plt.axhline(y=0.62, color='r', linestyle='--')
plt.text(-0.08, 0.62, '0.62', color='r', fontsize=12)
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel("% des clients")
plt.ylabel("% du chiffre d'affaire")
plt.grid(color='gray', linestyle='-', linewidth=0.5)
plt.title("Inégalité des chiffres d'affaires par client",fontsize=11)
plt.show()
Table "order_reviews"¶
- Doublons
order_id
# Doublons order_id
df_temp = pd.read_sql_query("""
SELECT order_id, COUNT(*) AS occurrence_count
FROM order_reviews
GROUP BY order_id
HAVING COUNT(*) >1;
""", conn)
nb_doublons = df_temp.shape[0]
display(Markdown(f"Il y a {nb_doublons} doublon(s) parmi les`order_id`,"\
+f" soit {nb_doublons} commandes avec des plusieurs commentaires."))
# Affichage d'un exemple de doublon
sample = df_temp['order_id'].head(1).values[0]
display(pd.read_sql_query("""
SELECT * FROM order_reviews
WHERE order_id = '{}';
""".format(sample), conn))
Il y a 547 doublon(s) parmi lesorder_id, soit 547 commandes avec des plusieurs commentaires.
| index | review_id | order_id | review_score | review_comment_title | review_comment_message | review_creation_date | review_answer_timestamp | |
|---|---|---|---|---|---|---|---|---|
| 0 | 22423 | 2a74b0559eb58fc1ff842ecc999594cb | 0035246a40f520710769010f752e7507 | 5 | None | Estou acostumada a comprar produtos pelo barat... | 2017-08-25 00:00:00 | 2017-08-29 21:45:57 |
| 1 | 25612 | 89a02c45c340aeeb1354a24e7d4b2c1e | 0035246a40f520710769010f752e7507 | 5 | None | None | 2017-08-29 00:00:00 | 2017-08-30 01:59:12 |
L'avis le plus récent sera conservé : les clients peuvent donner leur avis à toutes les étapes (commande, en attente de livraison, après livraison). Nous souhaitons avoir les avis après livraison : le critère utilisé sera review_answer_timestamp, colonne avec le moins de valeur manquante.
# Dernier avis sur la commande + status de la commande
reviews_df = pd.read_sql_query("""
WITH LastReview AS (
SELECT order_id, MAX(review_answer_timestamp) AS max_timestamp
FROM order_reviews
GROUP BY order_id
)
SELECT r.*, o.order_status
FROM order_reviews r
INNER JOIN LastReview l ON r.order_id = l.order_id AND r.review_answer_timestamp = l.max_timestamp
LEFT JOIN orders o ON r.order_id = o.order_id;
""", conn)
- Analyse des reviews
# Distribution des avis
plt.figure(figsize=(15, 5))
plt.title("Distribution des avis par status de commande", fontsize=18)
ax=sns.boxplot(data=reviews_df.sort_values('order_status'),
x='review_score',
y='order_status',
hue='order_status',
showfliers=True,
showmeans=True,
meanprops=meanprops)
# Ajout des valeurs de value_counts pour var sur les étiquettes en y
value_counts = reviews_df['order_status'].value_counts().sort_index()
ax.set_yticks(ax.get_yticks())
ax.set_yticklabels([f'{label} ({count})' for label, count in value_counts.items()])
plt.show()
# Commandes sans review
df_temp = pd.read_sql_query("""
WITH OrdersWithoutReviews AS (
SELECT o.order_id, o.order_status
FROM orders o
LEFT JOIN order_reviews r ON o.order_id = r.order_id
WHERE r.review_id IS NULL
),
TotalOrdersByStatus AS (
SELECT order_status, COUNT(order_id) AS total_orders
FROM orders
GROUP BY order_status
),
LastReview AS (
SELECT order_id, MAX(review_answer_timestamp) AS max_timestamp
FROM order_reviews
GROUP BY order_id
),
OrderWithReviews AS (
SELECT
COUNT(r.order_id) AS nb_order_with_reviews,
ROUND(AVG(r.review_score), 2) AS mean_review_score,
o.order_status
FROM order_reviews r
INNER JOIN LastReview l ON r.order_id = l.order_id AND r.review_answer_timestamp = l.max_timestamp
LEFT JOIN orders o ON r.order_id = o.order_id
GROUP BY o.order_status
)
SELECT
tob.order_status,
tob.total_orders,
o.nb_order_with_reviews,
ROUND((o.nb_order_with_reviews * 100.0 / tob.total_orders), 1) AS percent_order_with_reviews,
o.mean_review_score,
COUNT(owr.order_id) AS nb_order_without_reviews,
ROUND((COUNT(owr.order_id) * 100.0 / tob.total_orders), 1) AS percent_order_without_reviews
FROM TotalOrdersByStatus tob
LEFT JOIN OrdersWithoutReviews owr ON tob.order_status = owr.order_status
LEFT JOIN OrderWithReviews o ON tob.order_status = o.order_status
GROUP BY owr.order_status, tob.total_orders;
""", conn)
display(df_temp)
| order_status | total_orders | nb_order_with_reviews | percent_order_with_reviews | mean_review_score | nb_order_without_reviews | percent_order_without_reviews | |
|---|---|---|---|---|---|---|---|
| 0 | approved | 2 | 2 | 100.0 | 2.50 | 0 | 0.0 |
| 1 | canceled | 625 | 605 | 96.8 | 1.80 | 20 | 3.2 |
| 2 | created | 5 | 3 | 60.0 | 2.33 | 2 | 40.0 |
| 3 | delivered | 96478 | 95832 | 99.3 | 4.16 | 646 | 0.7 |
| 4 | invoiced | 314 | 309 | 98.4 | 1.63 | 5 | 1.6 |
| 5 | processing | 301 | 295 | 98.0 | 1.26 | 6 | 2.0 |
| 6 | shipped | 1107 | 1032 | 93.2 | 2.00 | 75 | 6.8 |
| 7 | unavailable | 609 | 595 | 97.7 | 1.53 | 14 | 2.3 |
- Discrétisation des review_score
# Dernier avis sur la commande + status de la commande = 'delivered'
review_score_df = pd.read_sql_query("""
WITH LastReview AS (
SELECT order_id, MAX(review_answer_timestamp) AS max_timestamp
FROM order_reviews
GROUP BY order_id
)
SELECT c.customer_unique_id, AVG(r.review_score) AS avg_review_score
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN order_reviews r ON o.order_id = r.order_id
LEFT JOIN LastReview l ON r.order_id = l.order_id AND r.review_answer_timestamp = l.max_timestamp
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY c.customer_unique_id;
""", conn)
Classes pour les avis :
- 0 : pas de review_score -> neutre
- -1 : score de [0, 4[ -> client insatisfait
- 1 : score de [4,5] -> client satisfait
# Définition des classes
review_score_df['avg_review_score'] = review_score_df['avg_review_score'].fillna(0)
review_score_df.loc[(review_score_df['avg_review_score']<4)&
(review_score_df['avg_review_score']>0), 'avg_review_score'] = -1
review_score_df.loc[review_score_df['avg_review_score']>=4, 'avg_review_score'] = 1
review_score_df.rename(columns={'avg_review_score':'review_score_class'}, inplace=True)
display(Markdown(f"Dataframe 'review_score_df' de dimensions {review_score_df.shape}, features : {list(review_score_df.columns)}"))
Dataframe 'review_score_df' de dimensions (93104, 2), features : ['customer_unique_id', 'review_score_class']
# Pourcentage des clients par classe
review_score_df['review_score_class'].value_counts(normalize=True)*100
review_score_class 1.0 78.221129 -1.0 21.134430 0.0 0.644441 Name: proportion, dtype: float64
Table "geoloc"¶
- Doublons
geolocation_zip_code_prefix
# Doublons geolocation_zip_code_prefix
df_temp = pd.read_sql_query("""
SELECT geolocation_zip_code_prefix, COUNT(*) AS occurrence_count
FROM geoloc
GROUP BY geolocation_zip_code_prefix
HAVING COUNT(*) >1;
""", conn)
nb_doublons = df_temp.shape[0]
display(Markdown(f"Il y a {nb_doublons} doublon(s) parmi les`geolocation_zip_code_prefix`"))
# Affichage d'un exemple de doublon
sample = df_temp['geolocation_zip_code_prefix'].head(1).values[0]
display(pd.read_sql_query("""
SELECT * FROM geoloc
WHERE geolocation_zip_code_prefix = '{}';
""".format(sample), conn))
Il y a 17972 doublon(s) parmi lesgeolocation_zip_code_prefix
| index | geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city | geolocation_state | |
|---|---|---|---|---|---|---|
| 0 | 99 | 1001 | -23.549292 | -46.633559 | sao paulo | SP |
| 1 | 206 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 2 | 235 | 1001 | -23.550642 | -46.634410 | sao paulo | SP |
| 3 | 299 | 1001 | -23.549698 | -46.633909 | sao paulo | SP |
| 4 | 326 | 1001 | -23.551427 | -46.634074 | sao paulo | SP |
| 5 | 429 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 6 | 519 | 1001 | -23.551337 | -46.634027 | sao paulo | SP |
| 7 | 575 | 1001 | -23.549779 | -46.633957 | são paulo | SP |
| 8 | 583 | 1001 | -23.551337 | -46.634027 | sao paulo | SP |
| 9 | 596 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 10 | 608 | 1001 | -23.550263 | -46.634196 | são paulo | SP |
| 11 | 639 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 12 | 771 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 13 | 818 | 1001 | -23.551337 | -46.634027 | sao paulo | SP |
| 14 | 851 | 1001 | -23.549825 | -46.633970 | sao paulo | SP |
| 15 | 864 | 1001 | -23.549825 | -46.633970 | sao paulo | SP |
| 16 | 897 | 1001 | -23.549292 | -46.633559 | sao paulo | SP |
| 17 | 912 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 18 | 985 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 19 | 1004 | 1001 | -23.549292 | -46.633559 | sao paulo | SP |
| 20 | 1062 | 1001 | -23.550498 | -46.634338 | sao paulo | SP |
| 21 | 1182 | 1001 | -23.549779 | -46.633957 | sao paulo | SP |
| 22 | 1246 | 1001 | -23.549292 | -46.633559 | sao paulo | SP |
| 23 | 1351 | 1001 | -23.549951 | -46.634027 | são paulo | SP |
| 24 | 1384 | 1001 | -23.549292 | -46.633559 | sao paulo | SP |
| 25 | 1435 | 1001 | -23.549292 | -46.633559 | sao paulo | SP |
# Boxplot 'geolocation_lat', 'geolocation_lng'
df_temp = pd.read_sql_query("""
SELECT
geolocation_zip_code_prefix,
AVG(geolocation_lat) AS geolocation_lat,
AVG(geolocation_lng) AS geolocation_lng
FROM geoloc
GROUP BY geolocation_zip_code_prefix;
""", conn)
fig, ax = plt.subplots(1,2,figsize=(20,5))
for i, loc in enumerate(['geolocation_lat', 'geolocation_lng']):
ax[i].set_title(f"Distribution de {loc}")
sns.boxplot(data=df_temp, x=loc, ax=ax[i], showfliers=True, showmeans=True, meanprops=meanprops)
plt.show()
Les coordonnées du Brésil sont :
- Latitudes de -35 à 6
- Longitude de -75 à -34
# # Localisation géographiques des codes postaux
# map_center = [df_temp['geolocation_lat'].mean(), df_temp['geolocation_lng'].mean()]
# mymap = folium.Map(location=map_center, zoom_start=2)
# for index, row in df_temp.iterrows():
# latitude = row['geolocation_lat']
# longitude = row['geolocation_lng']
# folium.CircleMarker(location=[latitude, longitude],
# radius=10,
# color='blue',
# fill=True,
# fill_color='blue').add_to(mymap)
# # Affichage de la carte
# mymap.save('map.html')
# mymap
Il y a en effet des localisations en dehors du Brésil.
geoloc_cleaned = pd.read_sql_query("""
WITH CleanedCities AS (
SELECT
geolocation_zip_code_prefix,
geolocation_lat,
geolocation_lng,
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
geolocation_city,
"'",''
),
'õ','o'
),
'ê','e'
),
'ç','c'
),
'ô','o'
),
'ú', 'u'
),
'á', 'a'
),
'ã', 'a'
),
'â', 'a'
),
'é', 'e'
),
'í', 'i'
),
'ó', 'o'
) AS geolocation_city_cleaned,
geolocation_state
FROM geoloc
)
SELECT
geolocation_zip_code_prefix,
AVG(geolocation_lat) AS geolocation_lat,
AVG(geolocation_lng) AS geolocation_lng,
geolocation_city_cleaned,
geolocation_state
FROM
CleanedCities
WHERE
geolocation_lng <-32
AND geolocation_lat < 6
GROUP BY
geolocation_zip_code_prefix, geolocation_city_cleaned, geolocation_state;
""", conn)
geoloc_cleaned.shape
(19592, 5)
# Doublons de 'geolocation_zip_code_prefix'
geoloc_cleaned[geoloc_cleaned['geolocation_zip_code_prefix'].duplicated(keep=False)]
| geolocation_zip_code_prefix | geolocation_lat | geolocation_lng | geolocation_city_cleaned | geolocation_state | |
|---|---|---|---|---|---|
| 159 | 1307 | -23.556812 | -46.657135 | sao bernardo do campo | SP |
| 160 | 1307 | -23.552765 | -46.653195 | sao paulo | SP |
| 368 | 2116 | -23.515978 | -46.582170 | sao paulo | RN |
| 369 | 2116 | -23.520455 | -46.585496 | sao paulo | SP |
| 1096 | 3203 | -23.216648 | -46.861371 | jundiai | SP |
| ... | ... | ... | ... | ... | ... |
| 19128 | 96222 | -32.079705 | -52.216097 | rio grande | RS |
| 19210 | 96859 | -29.559440 | -52.334720 | monte alverne | RS |
| 19211 | 96859 | -29.707241 | -52.444254 | santa cruz do sul | RS |
| 19279 | 97538 | -30.208563 | -57.550783 | barra do quarai | RS |
| 19280 | 97538 | -30.208502 | -57.551585 | barrado quarai | RS |
1138 rows × 5 columns
Il restera encore à réaliser une homogénéisation de l'orthographe des noms de villes et correspondances zip_code / geolocation_city / geolocation_state.
# Les doublons sont éliminés pour cette première étude
geoloc_cleaned.drop_duplicates(subset='geolocation_zip_code_prefix', inplace=True, ignore_index=True)
geoloc_cleaned.shape
(19011, 5)
- Localisation des clients
# localisation géographique des clients
customer_loc = pd.merge(customer_df,
geoloc_cleaned,
how='left',
left_on='customer_zip_code_prefix',
right_on='geolocation_zip_code_prefix',
indicator=True)
# Nb zip_code sans correspondances dans la table 'geoloc'
nb_empty_zipcode = len(customer_loc.loc[customer_loc['_merge']=='left_only','customer_zip_code_prefix'].unique())
# Nb city sans correspondances dans la table 'geoloc'
geoloc_city = list(geoloc_cleaned['geolocation_city_cleaned'].unique())
empty_city = list(customer_loc.loc[customer_loc['_merge']=='left_only','customer_city'].unique())
differences = len([x for x in empty_city if x not in geoloc_city])
display(Markdown(f"Il y a {nb_empty_zipcode} `customer_zip_code_prefix` sans correspondance dans la table 'geoloc' "))
display(Markdown(f"correspondant à {len(empty_city)} `customer_city` dont {differences} sans correspondance dans la table 'geoloc'"))
Il y a 152 customer_zip_code_prefix sans correspondance dans la table 'geoloc'
correspondant à 83 customer_city dont 42 sans correspondance dans la table 'geoloc'
La base de données 'geoloc' devra être mise à jour avec les coordonnées des codes postaux manquants (peut-être de nouveaux clients).
# Boucle pour itérer sur les lignes du 'customer_loc' avec des valeurs manquantes des colonnes latitude et longitude
customer_loc_cleaned = customer_loc.copy()
mask = (customer_loc_cleaned['geolocation_lat'].isna())&(customer_loc_cleaned['geolocation_lng'].isna())
for index, row in customer_loc[mask].iterrows():
city_name = row['customer_city']
# Si la ville est dans geoloc_city
if city_name in geoloc_city:
# Remplir les valeurs manquantes en utilisant la moyenne des coordonnées de la ville
latitude = geoloc_cleaned[geoloc_cleaned['geolocation_city_cleaned']==city_name]['geolocation_lat'].mean()
longitude = geoloc_cleaned[geoloc_cleaned['geolocation_city_cleaned']==city_name]['geolocation_lng'].mean()
customer_loc_cleaned.at[index, 'geolocation_lat'] = latitude
customer_loc_cleaned.at[index, 'geolocation_lng'] = longitude
# Nombre NaN
customer_loc_cleaned.isna().sum()
customer_unique_id 0 recency 0 frequency 0 mean_delivery_delay 0 customer_zip_code_prefix 0 customer_city 0 customer_state 0 geolocation_zip_code_prefix 255 geolocation_lat 48 geolocation_lng 48 geolocation_city_cleaned 255 geolocation_state 255 _merge 0 dtype: int64
# Sélection des colonnes et suppression des NaN
mask = customer_loc_cleaned['geolocation_lat'].isna()
customer_loc_cleaned = customer_loc_cleaned.loc[~mask, ['customer_unique_id', 'geolocation_lat', 'geolocation_lng']]
customer_loc_cleaned.shape
(93048, 3)
# # Localisation géographiques des clients
# map_center = [customer_loc_cleaned['geolocation_lat'].mean(), customer_loc_cleaned['geolocation_lng'].mean()]
# customer_map = folium.Map(location=map_center, zoom_start=2)
# for index, row in customer_loc_cleaned.iterrows():
# latitude = row['geolocation_lat']
# longitude = row['geolocation_lng']
# folium.CircleMarker(location=[latitude, longitude],
# radius=10,
# color='blue',
# fill=True,
# fill_color='blue').add_to(customer_map)
# # Affichage de la carte
# customer_map.save('customer_map.html')
# customer_map
Tables "products" et "translation"¶
- Nombre de produits par catégorie
# Nombre de produits par catégorie
df_temp = pd.read_sql_query("""
SELECT t.product_category_name_english, COUNT(*) AS nb_products
FROM products p
INNER JOIN translation t ON p.product_category_name = t.product_category_name
GROUP BY product_category_name_english
ORDER BY COUNT(*) DESC
""", conn)
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), tight_layout=True)
# Distribution des nombres de produits par catégorie
ax1.set_title("Distribution des nombres de produits par catégorie")
sns.histplot(data=df_temp, x='nb_products', ax=ax1)
# Top 10 des catégories de produits
ax2.set_title("Top 10 des catégories avec le plus de produits")
sns.barplot(data=df_temp.head(10), y='product_category_name_english', x='nb_products', orient='h', ax=ax2)
# Ajout des valeurs de nb_products
for index, value in enumerate(df_temp['nb_products'].head(10)):
ax2.text(value, index, str(value), ha='left', va='center')
plt.show()
# Nombre de produits par catégorie
df_temp = pd.read_sql_query("""
WITH product_sold AS (
SELECT
product_id,
COUNT(product_id) AS nb_product_sold
FROM order_items
GROUP BY product_id
)
SELECT
t.product_category_name_english AS product_category,
oi.product_id,
oi.price,
SUM(ps.nb_product_sold) AS nb_product_sold
FROM order_items oi
LEFT JOIN products p ON oi.product_id = p.product_id
LEFT JOIN translation t ON p.product_category_name = t.product_category_name
LEFT JOIN orders o ON oi.order_id = o.order_id
LEFT JOIN product_sold ps ON oi.product_id = ps.product_id
WHERE
o.order_purchase_timestamp < DATE('2018-09-01')
AND o.order_purchase_timestamp > DATE('2017-01-01')
AND o.order_status = 'delivered'
GROUP BY oi.product_id;
""", conn)
df_temp['total_sales_amount'] = df_temp['price']*df_temp['nb_product_sold']
fig, (ax1,ax2) = plt.subplots(2,1, figsize=(10,10), tight_layout=True)
# Top 10 des produits les plus vendus
ax1.set_title("Top 10 des produits avec les plus vendus")
sns.barplot(data=df_temp.sort_values('nb_product_sold', ascending=False).head(10),
y='product_id',
x='nb_product_sold',
hue='product_category',
orient='h',
ax=ax1)
# Top 10 des produits qui ont le plus rapporté
ax2.set_title("Top 10 des produits par montant des ventes")
sns.barplot(data=df_temp.sort_values('total_sales_amount', ascending=False).head(10),
y='product_id',
x='total_sales_amount',
hue='product_category',
orient='h',
ax=ax2)
plt.show()
# Agrégation des données par catégorie de produit
products_df = df_temp.groupby('product_category',as_index=False).agg(nb_product = ('product_id', 'count'),
mean_price =('price', 'mean'),
min_price = ('price', 'min'),
max_price = ('price', 'max'),
nb_product_sold = ('nb_product_sold','sum'),
total_sales_amount = ('total_sales_amount','sum')
)
products_df['%_sales'] = products_df['nb_product_sold'] / sum(products_df['nb_product_sold'])*100
products_df['%_gross_sales'] = products_df['total_sales_amount'] / sum(products_df['total_sales_amount'])*100
products_df.sort_values('%_gross_sales', ascending=False).head(10)
| product_category | nb_product | mean_price | min_price | max_price | nb_product_sold | total_sales_amount | %_sales | %_gross_sales | |
|---|---|---|---|---|---|---|---|---|---|
| 43 | health_beauty | 2380 | 147.472416 | 1.20 | 3124.00 | 473047 | 50624243.72 | 11.856346 | 13.765481 |
| 70 | watches_gifts | 1297 | 331.668658 | 8.99 | 3999.90 | 368911 | 49420844.70 | 9.246305 | 13.438259 |
| 42 | garden_tools | 723 | 220.902365 | 6.35 | 3930.00 | 795358 | 47220716.88 | 19.934679 | 12.840011 |
| 15 | computers_accessories | 1598 | 154.288567 | 3.90 | 3699.99 | 398644 | 43907776.24 | 9.991526 | 11.939174 |
| 7 | bed_bath_table | 2987 | 107.574620 | 6.99 | 1999.98 | 454713 | 41763032.47 | 11.396827 | 11.355986 |
| 39 | furniture_decor | 2559 | 102.929707 | 4.90 | 1899.00 | 407699 | 28094643.12 | 10.218478 | 7.639349 |
| 20 | cool_stuff | 767 | 213.048396 | 7.00 | 3109.99 | 124767 | 17422750.51 | 3.127133 | 4.737504 |
| 65 | sports_leisure | 2812 | 134.436046 | 4.50 | 4059.00 | 116628 | 9538371.51 | 2.923139 | 2.593624 |
| 49 | housewares | 2281 | 97.812968 | 3.06 | 6735.00 | 107828 | 9081050.39 | 2.702577 | 2.469272 |
| 59 | perfumery | 850 | 121.568859 | 4.99 | 689.90 | 95313 | 8217393.02 | 2.388904 | 2.234431 |
products_df.sort_values('%_gross_sales', ascending=False)
| product_category | nb_product | mean_price | min_price | max_price | nb_product_sold | total_sales_amount | %_sales | %_gross_sales | |
|---|---|---|---|---|---|---|---|---|---|
| 43 | health_beauty | 2380 | 147.472416 | 1.20 | 3124.00 | 473047 | 50624243.72 | 11.856346 | 13.765481 |
| 70 | watches_gifts | 1297 | 331.668658 | 8.99 | 3999.90 | 368911 | 49420844.70 | 9.246305 | 13.438259 |
| 42 | garden_tools | 723 | 220.902365 | 6.35 | 3930.00 | 795358 | 47220716.88 | 19.934679 | 12.840011 |
| 15 | computers_accessories | 1598 | 154.288567 | 3.90 | 3699.99 | 398644 | 43907776.24 | 9.991526 | 11.939174 |
| 7 | bed_bath_table | 2987 | 107.574620 | 6.99 | 1999.98 | 454713 | 41763032.47 | 11.396827 | 11.355986 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3 | arts_and_craftmanship | 19 | 65.744737 | 9.80 | 289.49 | 46 | 4434.61 | 0.001153 | 0.001206 |
| 32 | fashion_sport | 19 | 78.127895 | 24.99 | 225.00 | 65 | 3918.28 | 0.001629 | 0.001065 |
| 52 | la_cuisine | 10 | 155.699000 | 24.00 | 389.00 | 24 | 3423.99 | 0.000602 | 0.000931 |
| 29 | fashion_childrens_clothes | 4 | 87.495000 | 39.99 | 110.00 | 15 | 939.87 | 0.000376 | 0.000256 |
| 61 | security_and_services | 2 | 141.645000 | 100.00 | 183.29 | 2 | 283.29 | 0.000050 | 0.000077 |
71 rows × 9 columns
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5), tight_layout=True)
# Top 10 des catégories avec le plus vendues
ax1.set_title("Top 10 des catégories avec le plus vendues")
sns.barplot(data=products_df.sort_values('%_sales', ascending=False).head(10),
y='product_category',
x='%_sales',
orient='h',
ax=ax1)
# Top 10 des catégories qui ont le plus rapportée
ax2.set_title("Top 10 des catégories par chiffre d'affaire")
sns.barplot(data=products_df.sort_values('%_gross_sales', ascending=False).head(10),
y='product_category',
x='%_gross_sales',
orient='h',
ax=ax2)
plt.show()
# Top 10
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(10, 10), tight_layout=True)
# Top 10 des prix moyen par catégorie produits
ax1.set_title("Top 10 des prix moyen par catégorie")
sns.barplot(data=products_df.sort_values('mean_price', ascending=False).head(10),
y='product_category',
x='mean_price',
orient='h',
ax=ax1)
# Top 10 des catégories avec le plus vendues
ax2.set_title("Top 10 des catégories avec le plus vendues")
sns.barplot(data=products_df.sort_values('nb_product_sold', ascending=False).head(10),
y='product_category',
x='nb_product_sold',
orient='h',
ax=ax2)
# Top 10 des catégories qui ont le plus rapportée
ax3.set_title("Top 10 des catégories par chiffre d'affaire")
sns.barplot(data=products_df.sort_values('total_sales_amount', ascending=False).head(10),
y='product_category',
x='total_sales_amount',
orient='h',
ax=ax3)
plt.show()
Table "sellers"¶
- Clé primaire et doublons
# Doublons seller_id
nb_doublons = pd.read_sql_query("""
SELECT COUNT(*) AS nombre_doublons
FROM (
SELECT seller_id, COUNT(*) AS occurrence_count
FROM sellers
GROUP BY seller_id
HAVING COUNT(*) > 1
) AS doublons;
""", conn)
display(Markdown(f"Il y a {nb_doublons.iloc[0,0]} doublon(s) parmi les`seller_id`."))
Il y a 0 doublon(s) parmi lesseller_id.
# Nombre de clients `seller_id`
df_temp = pd.read_sql_query("""
SELECT seller_id, COUNT(seller_id) AS occurrence_count
FROM sellers
GROUP BY seller_id;
""", conn)
display(Markdown(f"Olits a {df_temp.shape[0]} `seller_id` référencés dans sa base de données"))
Olits a 3095 seller_id référencés dans sa base de données
- Nombre de vendeurs par état
# Nombre de clients par état
df_temp = pd.read_sql_query("""
SELECT seller_state, COUNT(seller_id) AS nb_sellers
FROM sellers
GROUP BY seller_state
ORDER BY nb_sellers DESC;
""", conn)
# Représentation sous forme de barplot
plt.figure(figsize=(15,7))
sns.barplot(data = df_temp, x='seller_state', y='nb_sellers', palette='viridis', hue='nb_sellers')
plt.title('Nombre de vendeurs par état')
plt.show()
# Nombre de city par état
df_temp = pd.read_sql_query("""
SELECT seller_state, COUNT(DISTINCT seller_city) AS nb_cities
FROM sellers
GROUP BY seller_state
ORDER BY nb_cities DESC;
""", conn)
# Représentation sous forme de barplot
plt.figure(figsize=(15,7))
sns.barplot(data = df_temp, x='seller_state', y='nb_cities', palette='viridis', hue='nb_cities')
plt.title('Nombre de villes par état où il y a des vendeurs')
plt.show()
- Geolocalisation des vendeurs
# seller_df
seller_df = pd.read_sql_query("""
SELECT *
FROM sellers;
""", conn)
# ajout des coordonnées
seller_loc = pd.merge(seller_df,
geoloc_cleaned,
how='left',
left_on='seller_zip_code_prefix',
right_on='geolocation_zip_code_prefix',
indicator=True)
# Nb zip_code sans correspondances dans la table 'geoloc'
nb_empty_zipcode = len(seller_loc.loc[seller_loc['_merge']=='left_only','seller_zip_code_prefix'].unique())
# Nb city sans correspondances dans la table 'geoloc'
geoloc_city = list(geoloc_cleaned['geolocation_city_cleaned'].unique())
empty_city = list(seller_loc.loc[seller_loc['_merge']=='left_only','seller_city'].unique())
differences = len([x for x in empty_city if x not in geoloc_city])
display(Markdown(f"Il y a {nb_empty_zipcode} `seller_zip_code_prefix` sans correspondance dans la table 'geoloc' "))
display(Markdown(f"correspondant à {len(empty_city)} `seller_city` dont {differences} sans correspondance dans la table 'geoloc'"))
Il y a 7 seller_zip_code_prefix sans correspondance dans la table 'geoloc'
correspondant à 6 seller_city dont 0 sans correspondance dans la table 'geoloc'
# Boucle pour itérer sur les lignes du 'seller_loc' avec des valeurs manquantes des colonnes latitude et longitude
mask = (seller_loc['geolocation_lat'].isna())&(seller_loc['geolocation_lng'].isna())
for index, row in seller_loc[mask].iterrows():
city_name = row['seller_city']
# Si la ville est dans geoloc_city
if city_name in geoloc_city:
# Remplir les valeurs manquantes en utilisant la moyenne des coordonnées de la ville
latitude = geoloc_cleaned[geoloc_cleaned['geolocation_city_cleaned']==city_name]['geolocation_lat'].mean()
longitude = geoloc_cleaned[geoloc_cleaned['geolocation_city_cleaned']==city_name]['geolocation_lng'].mean()
seller_loc.at[index, 'geolocation_lat'] = latitude
seller_loc.at[index, 'geolocation_lng'] = longitude
# Localisation géographiques des clients
map_center = [seller_loc['geolocation_lat'].mean(), seller_loc['geolocation_lng'].mean()]
seller_map = folium.Map(location=map_center, zoom_start=2)
for index, row in seller_loc.iterrows():
latitude = row['geolocation_lat']
longitude = row['geolocation_lng']
folium.CircleMarker(location=[latitude, longitude],
radius=10,
color='blue',
fill=True,
fill_color='blue').add_to(seller_map)
# Affichage de la carte
seller_map.save('seller_map.html')
seller_map
- Chiffre d'affaires par vendeurs
# Chiffres d'affaires par vendeurs
df_temp = pd.read_sql_query("""
SELECT
s.seller_id,
SUM(oi.price) AS gross_sales
FROM sellers s
LEFT JOIN order_items oi ON s.seller_id = oi.seller_id
LEFT JOIN orders o ON oi.order_id = o.order_id
WHERE o.order_status = 'delivered'
GROUP BY s.seller_id
ORDER BY gross_sales DESC;
""", conn)
# Représentation sous forme de histplot
sns.histplot(data = df_temp, x='gross_sales')
plt.title('Distribution des CA par vendeurs')
plt.show()
# Courbe de Lorenz
seller = df_temp['gross_sales'].values
n = len(seller)
lorenz = np.cumsum(np.sort(seller)) / seller.sum()
lorenz = np.append([0],lorenz) # La courbe de Lorenz commence à 0
xaxis = np.linspace(0-1/n,1+1/n,n+1) #Il y a un segment de taille n pour chaque individu, plus 1 segment supplémentaire d'ordonnée 0. Le premier segment commence à 0-1/n, et le dernier termine à 1+1/n.
plt.plot(xaxis,lorenz,drawstyle='steps-post')
plt.plot([0,1], [0,1]) #tracer la bisséctrice
plt.axvline(x=0.90, color='r', linestyle='--')
plt.text(0.90, -0.06, '0.90', color='r', fontsize=12)
plt.axhline(y=0.33, color='r', linestyle='--')
plt.text(-0.08, 0.33, '0.33', color='r', fontsize=12)
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel("% des vendeurs")
plt.ylabel("% du chiffre d'affaire")
plt.grid(color='gray', linestyle='-', linewidth=0.5)
plt.title("Inégalité des chiffres d'affaires par vendeurs",fontsize=11)
plt.show()
Conclusion de l'analyse exploratoire¶
- L'identifiant unique client est
customer_unique_id - Pour un même client, il peut y avoir des zip_code, vity ou state différent : sera conservée la localisation de la dernière commande.
- Filtre sur les commandes
order_status= 'delivered' - Filtre sur Janvier 2017<=
order_purchase_timestamp< Septembre 2019 - Les
paiement_type"not_defined" seront éliminés par le filtreorder_status= 'delivered' - Les clients B2B sont à supprimer (plus de 100 exemplaires d'un même produit dans les catégories computers_accessories, auto, watches_gifts, telephony).
- Coordonnées : 'geolocation_lng' < -32 et 'geolocation_lng' < 6
Partie 3 - Données finales pour la segmentation client ¶
Jonction des tables aggrégées par client¶
# Jonction entre customer_df & monetary_df
display(Markdown("**Jonction entre customer_df & monetary_df**"))
df_temp1 = pd.merge(customer_df, monetary_df, on='customer_unique_id', how='inner')
display(Markdown(f"* customer_df: {customer_df.shape}, recency_frequency_df: {monetary_df.shape}, df_temp1: {df_temp1.shape}"))
# Jonction entre (customer_df, monetary_df) & review_score_df
display(Markdown("**Jonction entre (customer_df, monetary_df) & review_score_df**"))
df_temp2 = pd.merge(df_temp1, review_score_df, how='inner', on='customer_unique_id')
display(Markdown(f"* df_temp1: {df_temp1.shape}, review_score_df :{review_score_df.shape}, df_temp2 : {df_temp2.shape}"))
# Jonction entre (customer_df, monetary_df, review_score_df) & customer_loc_cleaned
display(Markdown("**Jonction entre (customer_df, monetary_df, review_score_df) & customer_loc_cleaned**"))
df_temp3 = pd.merge(df_temp2, customer_loc_cleaned, how='inner', on='customer_unique_id')
display(Markdown(f"* df_temp2: {df_temp2.shape}, customer_loc_cleaned :{customer_loc_cleaned.shape}, df_temp3 : {df_temp3.shape}"))
# Jonction entre (customer_df, monetary_df, review_score_df, customer_loc_cleaned) & purchase_category_df
display(Markdown("**Jonction entre (customer_df, monetary_df, review_score_df, customer_loc_cleaned) & purchases_df**"))
df_cleaned = pd.merge(df_temp3, purchases_df, how='inner', on='customer_unique_id')
display(Markdown(f"* df_temp3: {df_temp3.shape}, purchases_df :{purchases_df.shape}, df_cleaned : {df_cleaned.shape}"))
Jonction entre customer_df & monetary_df
- customer_df: (93096, 7), recency_frequency_df: (93104, 3), df_temp1: (93096, 9)
Jonction entre (customer_df, monetary_df) & review_score_df
- df_temp1: (93096, 9), review_score_df :(93104, 2), df_temp2 : (93096, 10)
Jonction entre (customer_df, monetary_df, review_score_df) & customer_loc_cleaned
- df_temp2: (93096, 10), customer_loc_cleaned :(93048, 3), df_temp3 : (93048, 12)
Jonction entre (customer_df, monetary_df, review_score_df, customer_loc_cleaned) & purchases_df
- df_temp3: (93048, 12), purchases_df :(93098, 75), df_cleaned : (93042, 86)
# Afficher les features retenues
Markdown("Les variables retenues pour la segmentation sont : \n{}".format(list(df_cleaned.columns)))
Les variables retenues pour la segmentation sont :
['customer_unique_id', 'recency', 'frequency', 'mean_delivery_delay', 'customer_zip_code_prefix', 'customer_city', 'customer_state', 'monetary', 'average_basket_amount', 'review_score_class', 'geolocation_lat', 'geolocation_lng', 'average_basket', 'agro_industry_and_commerce', 'air_conditioning', 'art', 'arts_and_craftmanship', 'audio', 'auto', 'baby', 'bed_bath_table', 'books_general_interest', 'books_imported', 'books_technical', 'category_not_defined', 'cds_dvds_musicals', 'christmas_supplies', 'cine_photo', 'computers', 'computers_accessories', 'consoles_games', 'construction_tools_construction', 'construction_tools_lights', 'construction_tools_safety', 'cool_stuff', 'costruction_tools_garden', 'costruction_tools_tools', 'diapers_and_hygiene', 'drinks', 'dvds_blu_ray', 'electronics', 'fashio_female_clothing', 'fashion_bags_accessories', 'fashion_childrens_clothes', 'fashion_male_clothing', 'fashion_shoes', 'fashion_sport', 'fashion_underwear_beach', 'fixed_telephony', 'flowers', 'food', 'food_drink', 'furniture_bedroom', 'furniture_decor', 'furniture_living_room', 'furniture_mattress_and_upholstery', 'garden_tools', 'health_beauty', 'home_appliances', 'home_appliances_2', 'home_comfort_2', 'home_confort', 'home_construction', 'housewares', 'industry_commerce_and_business', 'kitchen_dining_laundry_garden_furniture', 'la_cuisine', 'luggage_accessories', 'market_place', 'music', 'musical_instruments', 'office_furniture', 'party_supplies', 'perfumery', 'pet_shop', 'security_and_services', 'signaling_and_security', 'small_appliances', 'small_appliances_home_oven_and_coffee', 'sports_leisure', 'stationery', 'tablets_printing_image', 'telephony', 'toys', 'watches_gifts', 'total_items']
Analyse descriptive du jeu de données final¶
# Changer le type de 'customer_zip_code_prefix'
df_cleaned['customer_zip_code_prefix'] = df_cleaned['customer_zip_code_prefix'].astype('str')
df_cleaned.describe()
| recency | frequency | mean_delivery_delay | monetary | average_basket_amount | review_score_class | geolocation_lat | geolocation_lng | average_basket | agro_industry_and_commerce | air_conditioning | art | arts_and_craftmanship | audio | auto | baby | bed_bath_table | books_general_interest | books_imported | books_technical | category_not_defined | cds_dvds_musicals | christmas_supplies | cine_photo | computers | computers_accessories | consoles_games | construction_tools_construction | construction_tools_lights | construction_tools_safety | cool_stuff | costruction_tools_garden | costruction_tools_tools | diapers_and_hygiene | drinks | dvds_blu_ray | electronics | fashio_female_clothing | fashion_bags_accessories | fashion_childrens_clothes | fashion_male_clothing | fashion_shoes | fashion_sport | fashion_underwear_beach | fixed_telephony | flowers | food | food_drink | furniture_bedroom | furniture_decor | furniture_living_room | furniture_mattress_and_upholstery | garden_tools | health_beauty | home_appliances | home_appliances_2 | home_comfort_2 | home_confort | home_construction | housewares | industry_commerce_and_business | kitchen_dining_laundry_garden_furniture | la_cuisine | luggage_accessories | market_place | music | musical_instruments | office_furniture | party_supplies | perfumery | pet_shop | security_and_services | signaling_and_security | small_appliances | small_appliances_home_oven_and_coffee | sports_leisure | stationery | tablets_printing_image | telephony | toys | watches_gifts | total_items | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 | 93042.000000 |
| mean | 285.299187 | 1.033361 | -11.086532 | 165.035533 | 160.194016 | 0.570828 | -21.183221 | -46.190129 | 1.138650 | 0.003332 | 0.004245 | 0.002246 | 0.000301 | 0.004256 | 0.053739 | 0.039380 | 0.171611 | 0.006943 | 0.000774 | 0.003020 | 0.022495 | 0.000193 | 0.002665 | 0.001021 | 0.002805 | 0.120698 | 0.013854 | 0.017540 | 0.005868 | 0.003063 | 0.045904 | 0.003805 | 0.001236 | 0.000967 | 0.007566 | 0.000795 | 0.036016 | 0.000871 | 0.026235 | 0.000097 | 0.001827 | 0.003396 | 0.000462 | 0.001709 | 0.004686 | 0.000516 | 0.007373 | 0.004600 | 0.001440 | 0.154188 | 0.008931 | 0.000398 | 0.078126 | 0.130694 | 0.008244 | 0.002579 | 0.000494 | 0.006384 | 0.010845 | 0.116625 | 0.003687 | 0.003934 | 0.000193 | 0.013177 | 0.004310 | 0.000430 | 0.008179 | 0.035866 | 0.000548 | 0.043260 | 0.027762 | 0.000021 | 0.004450 | 0.008545 | 0.000806 | 0.117549 | 0.032555 | 0.001010 | 0.054878 | 0.049902 | 0.074203 | 1.632327 |
| std | 150.947878 | 0.208854 | 10.043454 | 225.680792 | 219.376385 | 0.817144 | 5.602494 | 4.051123 | 0.514895 | 0.175539 | 0.186763 | 0.052304 | 0.022711 | 0.081915 | 0.430999 | 0.581746 | 1.104005 | 0.141067 | 0.041462 | 0.063921 | 0.345400 | 0.021246 | 0.182809 | 0.067896 | 0.102435 | 1.073750 | 0.236390 | 0.536832 | 0.256362 | 0.170102 | 0.386114 | 0.136385 | 0.044814 | 0.173226 | 0.428139 | 0.041720 | 0.368559 | 0.100884 | 0.326773 | 0.011820 | 0.079881 | 0.112852 | 0.055538 | 0.094491 | 0.286067 | 0.038787 | 0.253413 | 0.184991 | 0.061666 | 1.455265 | 0.407177 | 0.019938 | 1.277852 | 1.618400 | 0.102666 | 0.057291 | 0.042993 | 0.152585 | 0.315955 | 1.374851 | 0.110582 | 0.128594 | 0.019115 | 0.169777 | 0.170769 | 0.021742 | 0.137879 | 0.872422 | 0.032943 | 0.408144 | 0.373613 | 0.004636 | 0.255761 | 0.174102 | 0.030568 | 0.716340 | 0.330166 | 0.039871 | 0.394661 | 0.372720 | 0.551803 | 3.688936 |
| min | 49.000000 | 1.000000 | -146.016123 | 9.590000 | 9.590000 | -1.000000 | -33.689948 | -72.668881 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 163.000000 | 1.000000 | -16.208915 | 63.012500 | 62.358750 | 1.000000 | -23.589413 | -48.119200 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 50% | 267.000000 | 1.000000 | -11.618119 | 107.780000 | 105.620000 | 1.000000 | -22.922759 | -46.633601 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 394.000000 | 1.000000 | -6.386516 | 182.460000 | 176.525000 | 1.000000 | -20.077863 | -43.629511 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 650.000000 | 15.000000 | 188.975081 | 13664.080000 | 13664.080000 | 1.000000 | 3.842508 | -34.799347 | 21.000000 | 36.000000 | 49.000000 | 4.000000 | 4.000000 | 9.000000 | 36.000000 | 144.000000 | 125.000000 | 16.000000 | 4.000000 | 5.000000 | 40.000000 | 4.000000 | 36.000000 | 16.000000 | 16.000000 | 100.000000 | 36.000000 | 100.000000 | 36.000000 | 36.000000 | 49.000000 | 25.000000 | 4.000000 | 50.000000 | 100.000000 | 6.000000 | 48.000000 | 25.000000 | 36.000000 | 2.000000 | 9.000000 | 25.000000 | 16.000000 | 25.000000 | 64.000000 | 7.000000 | 61.000000 | 25.000000 | 9.000000 | 234.000000 | 100.000000 | 1.000000 | 225.000000 | 441.000000 | 4.000000 | 4.000000 | 9.000000 | 25.000000 | 36.000000 | 144.000000 | 16.000000 | 25.000000 | 4.000000 | 25.000000 | 36.000000 | 2.000000 | 16.000000 | 137.000000 | 4.000000 | 36.000000 | 49.000000 | 1.000000 | 36.000000 | 36.000000 | 4.000000 | 49.000000 | 36.000000 | 4.000000 | 50.000000 | 36.000000 | 72.000000 | 441.000000 |
# Distribution des variables numériques (sans les colonnes catégories)
num_col = df_cleaned.select_dtypes(include='number').columns.tolist()
dist_features = [col for col in num_col if col not in product_category]
plot_kde(df_cleaned, 4, dist_features)
Corrélation entre les variables¶
# Calcul du coefficient de corrélation de Pearson et représentation sous forme de heatmap
plt.figure(figsize=(15,10))
corr = round(df_cleaned[dist_features].corr(method='pearson'),2)
mask = np.triu(np.ones_like(corr, dtype=bool))
heatmap = sns.heatmap(corr,
vmin=-1,
vmax=1,
annot=True,
cmap='coolwarm',
mask=mask)
heatmap.set_title('Correlation Heatmap (coefficient de Pearson)')
plt.show()
Essais de transformation pour normalisation des distributions¶
# essai de transformation log de 'frequency', 'monetary', 'average_basket_amount', 'average_basket', 'total_items'
log_transf = ['frequency', 'monetary', 'average_basket_amount', 'average_basket', 'total_items']
log_transf_df = np.log10(df_cleaned[log_transf])
log_transf_df.rename(columns={col: f'log_{col}' for col in log_transf_df.columns}, inplace=True)
plot_kde(log_transf_df, 2, log_transf_df.columns)
Sauvegarde du jeu de données¶
# Export de df_cleaned
df_cleaned.to_pickle('df_cleaned.pkl')
Conclusions et perspectives¶
Conclusions¶
Plusieurs variables ont été identifiées pour la segmentation clients :
- variables RFM (Recency, Frequency, Monetary)
- review_score_class
- mean_delivery_delay
- average_basket_amount, average_basket
- customer_zip_code_prefix, customer_city, customer_state,
- geolocation_lat, geolocation_lng
- items par catégorie
- total_items
La BDD contient seulement 3% de clients qui ont effectuées plusieurs achats.
L'adresse de commande semble être la localisation au moment de la commande.
Perspectives¶
- Analyser plus en détails les avis : longueur des commentaires, analyse de sentiments, analyse des avis pour les différentes phases d'achat, les avis par vendeurs
- Analyser les délais de livraison, la localisation des vendeurs
- Analyser les pics de consommation
- Avoir plus de données : âges des clients, type de quartiers (aisés ou non) suivant la localisation